Netizen: Large World Models

Showing posts with label Large World Models. Show all posts

Saturday, June 06, 2026

World Labs' Fei-Fei Li on Creating Large World Models: The Next Frontier Beyond Language

World Labs' Fei-Fei Li on Creating Large World Models: The Next Frontier Beyond Language 
Fei-Fei Li, often called the “Godmother of AI” for her foundational work on ImageNet and computer vision, has shifted focus from static image understanding to dynamic, embodied intelligence. As co-founder and CEO of World Labs (founded in 2024 with Justin Johnson, Christoph Lassner, and Ben Mildenhall), she is championing Large World Models (LWMs) as the path to spatial intelligence—AI that perceives, generates, reasons about, and interacts with the 3D (and 4D, time-inclusive) physical and virtual world. 
In a June 2026 Bloomberg Tech interview with Emily Chang, Li articulated a vision rooted in evolutionary biology: animal intelligence began with seeing and moving in the physical world ~500 million years ago. Human cognition, creativity, work, and daily life heavily rely on perceiving, understanding, reasoning about, and interacting with spatial environments—far beyond what words alone can capture. What Are Large World Models?Li contrasts LWMs with Large Language Models (LLMs). LLMs excel at language but are “wordsmiths in the dark,” lacking grounding in geometry, physics, dynamics, or causality. World models operate on pixels, voxels, or 3D representations to build internal simulations of the world. 
In her Substack essay “A Functional Taxonomy of World Models,” Li and the World Labs team outline three core functional types (with overlap and convergence as the goal): 
Renderer: Focuses on visual fidelity for human consumption (e.g., text-to-video models). Generates beautiful pixels but often lacks true 3D consistency, physics, or geometric accuracy. Outputs may look good from one view but break under navigation or interaction.
Planner: Optimized for machines/robots. Takes world state as input and outputs actions or next steps. Common in current robotics but may lack rich generative or simulation depth.
Simulator (the linchpin): Consumed by both humans and machines. Respects structure, physics, dynamics, semantics, and 3D/4D information. It can render views and support planning. This unified capability enables persistent, editable, interactive worlds. 
Marble, World Labs’ first commercial multimodal world model (launched generally available in November 2025), generates high-fidelity, persistent, spatially consistent 3D environments from text, images, video, panoramas, or spatial sketches. It uses Gaussian splats for visuals plus collision meshes for physics/interaction, supporting editing, exploration, camera control, and use across devices (phones, laptops, VR).
It targets creative workflows (games, film, design) while laying groundwork for broader applications. 
Li emphasizes that true world models must be generative (create diverse, consistent worlds from prompts), support multimodal interaction, enable reasoning and prediction over time, and bridge perception to action. Spatial Intelligence, Space-Time, Physical AI, and AGI for RobotsYes, this is fundamentally about spatial intelligence—the scaffold of cognition that allows understanding depth, navigation, object interactions, mental rotation, physics prediction, and causal reasoning in 3D space. It extends to space-time intelligence by modeling dynamics and change over time (4D). 
It directly supports Physical AI and embodied agents (robots, avatars). LLMs alone cannot reliably “put down fires, cook an omelet,” design physical spaces, or enable robots to act fluently in unstructured environments. World models provide the simulation layer for training, planning, and optimization—digital twins, industrial design, inspection, and real-world robotics. 
Regarding AGI for robots: Li sees this as a critical enabler for the robotics revolution, but she is measured. Humanoids have attracted billions in funding, yet practical tasks (e.g., loading a dishwasher quickly or fetching packages) remain challenging. World models are a foundational technology to close the “sim-to-real” gap and bridge hype to reality, but progress requires sustained, thoughtful investment beyond current LLM-scale efforts. Robotics investment so far is modest compared to self-driving or language models. 
Li views spatial intelligence as complementary to language models in multimodal systems, not a replacement. Together, they enable richer AI that understands both “words” and “worlds.” Multiple Angles: Opportunities, Challenges, and Broader ImplicationsCreative and Professional Applications: Interior design, architecture, gaming, film VFX, storytelling—users could prompt or photo a space and iteratively edit it (change curtain colors, redesign layouts) with persistence and physical plausibility. Professional creators, designers, engineers, and researchers benefit first. 
Industrial and Scientific: Digital twins for manufacturing/optimization, robotics training/simulation, healthcare (e.g., surgical planning), scientific visualization (e.g., molecular or environmental modeling).
Societal and Philosophical: Grounding AI in the physical world could make it more useful, trustworthy, and aligned with human needs. Li stresses building technology that augments humanity—empowering rather than harming—through careful data choices, evaluations, guardrails, and responsible development. She critiques hype-driven “safety theater” in favor of scientifically grounded work already happening in fields like healthcare. 
Challenges: Data scarcity for 3D/physics/robotics (vs. abundant video for renderers); balancing visual beauty with physical precision; computational demands; reconciling renderer/planner/simulator capabilities in unified models; ethical/safety considerations as systems become more capable. 
Li notes the field is young—World Labs gained an early lead by focusing on this post-LLM frontier, but competitors (e.g., Google’s Genie, Nvidia’s efforts) are advancing. A “ChatGPT moment” may look different: more enterprise/professional adoption initially than viral consumer chat. Where We Stand Now (Mid-2026)World Labs has raised significant funding (hundreds of millions to $1B+ rounds reported) with investors including Nvidia, Autodesk, and others. Marble is commercially available (freemium/paid). 
Progress includes consistent 3D generation and basic interactivity, but full physics-aware simulation, long-horizon reasoning, and robust sim-to-real transfer for advanced robotics are still evolving.
Broader ecosystem: Rapid activity in video/world generation models, but many remain renderer-focused or lack full spatial grounding. Li’s team published a taxonomy to clarify the space. 
What to Expect in the Next Few YearsShort-term (1–2 years): Improved Marble iterations with better physics, multimodality, real-time performance, and integration into design/robotics tools. More professional adoption in creative industries and early digital twins. Hybrid LLM + world model systems for richer interfaces.
Medium-term (3–5 years): Unified models bridging rendering, simulation, and planning. Significant advances in robot learning via high-fidelity simulation. Practical applications in manufacturing, logistics, healthcare assistance, and immersive entertainment/VR/AR. Consumer tools for personal design/spatial creativity.
Longer-term: Spatial intelligence as a core pillar of more general AI, enabling fluent physical-world agents. Transformative impact on productivity, creativity, and human-AI collaboration—potentially rivaling or exceeding the LLM wave in real-world utility.
Li’s vision is ambitious yet grounded: spatial intelligence is not just the next technical step but a return to the foundations of intelligence itself. By building machines that truly understand and create worlds, not just words, AI can move from impressive pattern-matching to genuine utility and partnership in the physical (and virtual) realities that define human experience. 
World Labs and similar efforts signal a pivotal shift. The coming years will test how quickly these models scale, generalize, and integrate—potentially reshaping industries and our relationship with AI.

Exploring World Labs' Marble: Capabilities of the Multimodal World Model 
Marble is World Labs' first commercial product, launched generally available on November 12, 2025. It is a frontier multimodal world model designed to generate, reconstruct, edit, and simulate high-fidelity, persistent, and spatially consistent 3D worlds. Powered by World Labs' spatial intelligence research, it bridges 2D inputs (text, images, video) into explorable 3D environments that support navigation, interaction, editing, and export. 
Unlike many video or image generators (often "renderers" focused on visual beauty), Marble emphasizes geometric consistency, persistence, and usability across human and machine applications. Worlds are built with technologies like 3D Gaussian Splatting (for visuals) and support collision meshes/physics for interaction. Core Input Capabilities (Multimodal Creation)Marble supports diverse inputs for flexible workflows: 
Text Prompts: Natural language descriptions generate complete 3D worlds. Examples include detailed scenes like "a sunlit stone castle courtyard" or whimsical environments.
Single Image: Turns a photo or AI-generated image into a full navigable 3D world. Great for lifting 2D concepts into 3D.
Multi-Image: Upload images for different views (e.g., Front, Back, Left, Right, or Auto Layout). Marble stitches them into a coherent 3D space with seamless transitions—ideal for precise control or real-world photo reconstruction.
Video: Short clips (under 100MB) provide rich spatial data, such as rotational views.
Panoramas (360°): Offers maximum layout control and spatial accuracy.
Coarse 3D Layouts (Chisel Tool): An AI-native 3D sculpting mode. Users block out structures with basic shapes (boxes, planes) or import assets. Add a text prompt for style/details. This decouples structure (layout) from style (visuals), enabling precise architectural or design control. 
Generation Options: Full world generation (~5 min) or faster "draft" modes (~20 sec). Generation times vary by complexity. Editing and Iteration ToolsMarble stands out for its iterative, AI-native editing, moving beyond one-shot generation: Pano Edit: Edit via panoramic views. Select areas and describe changes in natural language (e.g., "change counters to black granite" or "turn back wall into a stage").
Click and Expand: Grow worlds by clicking unexplored areas for seamless extensions.
Variations: Generate alternative versions while preserving core elements.
Studio Tools: Advanced composition—combine multiple worlds into larger environments (e.g., game maps or building complexes). Record cinematic camera paths and flythrough videos. 
This supports highly iterative creative processes common in design, film, and game development.Outputs and Export OptionsGenerated worlds are persistent and explorable in-browser (desktop/mobile, with VR support). Exports include: 
Gaussian Splats (for high-fidelity visuals).
Meshes (with collision for physics/interaction).
Videos (from recorded paths).
Assets for game engines, DCC tools (Blender, Maya, etc.), 3D printing, or CAD.
Web/VR shareable links. 
Key Strengths and Use CasesSpatial Consistency and Persistence: Worlds maintain coherence when navigating or editing, reducing "morphing" issues common in some generative models. 
Creative Industries: Game development, VFX/film (set design, storyboarding), architecture/interior design, immersive storytelling, concept art.
Robotics and Simulation: Rapid generation of diverse, photorealistic training environments with physics support. Integrates with tools like Isaac Sim, MuJoCo, Omniverse. Enables "described datasets" for scalable robot training. 
Professional Workflows: Digital twins, product visualization, education, scientific visualization.
Accessibility: Works across devices; community gallery for inspiration and sharing. 
Limitations (as of mid-2026): Generation can take minutes for full worlds; some advanced editing is desktop-preferred; physical simulation depth continues to evolve; credit-based system limits heavy free-tier use. Not fully real-time interactive like some research demos, prioritizing persistence and quality. Pricing and AccessFree Tier: Limited generations (e.g., ~4 per month from basic inputs).
Paid Tiers (Standard ~$20/mo, Pro ~$35/mo with commercial rights, Max ~$95/mo): More credits, advanced features (multi-image/video, expansion, priority), higher limits. 
World API: Launched January 2026 for developers to integrate Marble capabilities programmatically (credit-based). 
Marble Labs (creative hub) offers tutorials, case studies, and community showcases. Current Standing and OutlookMarble represents a strong early step in commercial Large World Models, emphasizing controllability and utility over pure spectacle. It has seen model updates (e.g., Marble 1.1 for better lighting/artifacts, 1.1-Plus for larger scenes). 
In the next few years, expect tighter integration with LLMs for natural interfaces, improved real-time performance/physics, broader robotics applications, and more seamless AR/VR deployment. It positions World Labs (and spatial intelligence) as a key pillar alongside language models for practical, embodied AI. 
To explore hands-on, visit marble.worldlabs.ai. Marble democratizes 3D world creation, turning imagination (or a quick photo/prompt) into persistent, editable digital realities.

NVIDIA Omniverse: The Platform for Physical AI, Digital Twins, and Simulation 
NVIDIA Omniverse is a modular platform of accelerated libraries, microservices, SDKs, and APIs built on OpenUSD (Universal Scene Description) for developing and operating 3D applications, physically accurate simulations, industrial digital twins, and agentic workflows for Physical AI. It has evolved from a real-time 3D collaboration and visualization tool into what NVIDIA positions as the "operating system" for the physical AI era and industrial digitalization. Core Architecture and Key ComponentsOpenUSD Foundation: Enables seamless data interoperability across 50+ formats and tools. It serves as the universal language for composing, aggregating, and simulating complex 3D scenes and digital twins. 
Nucleus: Database and collaboration engine for real-time data exchange, version control, authentication, and multi-user workflows across applications. 
Kit SDK: Framework for building custom OpenUSD-based applications, microservices, or extensions (in Python or C++). Supports headless deployment for scalable services. 
RTX Rendering & Neural Rendering (NuRec): Physically based, real-time ray/path tracing for photorealistic visualization, sensor simulation, and high-fidelity synthetic data. Supports Gaussian splats and neural volumes. 
Physics & Simulation: GPU-accelerated PhysX (and integrations like Newton) for rigid/soft bodies, fluids, vehicles, and complex interactions. Critical for robotics and industrial sims. 
Connectors & Exchange: Plugins for major DCC tools (e.g., Autodesk, Adobe, Siemens, PTC Onshape) to import/export USD data bidirectionally. 
Replicator & Synthetic Data: Generates labeled, physically accurate datasets with domain randomization for training perception models in robotics, AV, and vision AI. 
Agent Skills & APIs: Recent (2026) open-source tools turn complex simulation workflows into callable agent tasks. Cloud APIs (USD Render, Query, etc.) and integration with Cosmos World Foundation Models for generative capabilities. 

Isaac Sim (built on Omniverse) is the flagship robotics simulation framework, supporting URDF/MJCF import, ROS 2 integration, sensor simulation, and pairing with Isaac Lab for reinforcement learning and large-scale training. Capabilities and StrengthsReal-time Collaboration: Multiple users and tools work simultaneously in shared virtual spaces. 
High-Fidelity Simulation: Physics-accurate environments for testing robots, factories, or vehicles before physical deployment (sim-to-real transfer). 
Synthetic Data & AI Training: Scales generation of diverse datasets, reducing reliance on costly real-world data. 
Digital Twins: Photorealistic, live-connected virtual replicas for optimization, planning, and monitoring (e.g., factories, warehouses). 
Scalability: From desktop workstations to multi-GPU servers, cloud (DGX Cloud, AWS/Azure), and streaming (including to Apple Vision Pro). 
Extensibility: 600+ extensions, blueprints for common workflows, and open integration points. 

Integration with World Labs Marble: Marble-generated scenes (Gaussian splats + collider meshes) can be converted via Omniverse NuRec and imported into Isaac Sim for robotics training, combining generative world creation with high-fidelity physics simulation. Major Use Cases and IndustriesRobotics & Physical AI: Train humanoids and manipulators (e.g., with GR00T, Isaac Lab). Partners include Apptronik, Figure, FANUC, ABB, Skild AI. Sim-to-real successes in manipulation, navigation, and fleet coordination. 
Manufacturing & Industrial Digital Twins: BMW (31 factories), Siemens, Foxconn, Amazon Robotics (500k+ robots). 30-70% efficiency gains reported in planning, reconfiguration, and operations. 
Automotive & AV: Factory simulation, vehicle design, marketing content, autonomous driving validation (e.g., Nissan, GM). 
Architecture, Media & Entertainment: Real-time design reviews, virtual production, immersive experiences. 
Scientific & Other: CAE, molecular visualization, training/simulation for various domains. 
Where We Stand (Mid-2026)Omniverse has broad enterprise adoption (252+ companies, 300k+ downloads as of late 2025) and is deeply integrated into Physical AI stacks via Isaac, Cosmos, and agent tools. Recent releases emphasize agentic workflows, open-source skills, and tighter robotics integration. It excels in enterprise/industrial settings with strong physics, interoperability, and scalability, though it can have a steeper learning curve for pure creative users compared to some generative tools. 
Vs. World Labs Marble: Marble focuses on fast, generative creation of persistent 3D worlds (text/image/video-to-3D with editing). Omniverse/Isaac Sim provides the robust simulation, physics, sensor accuracy, and deployment layer for professional and robotics use. They are complementary—Marble scenes feed into Omniverse workflows. Access, Pricing, and Getting StartedFree/Individual: Core tools, Isaac Sim (open-source reference), limited collaboration. 
Enterprise: Subscription-based (Creator ~$2k/user/year, Reviewer lower, Nucleus separate). Starts with minimums; includes support and advanced features. Cloud options available. Trials exist. 
Developers: Download via NGC Catalog, use Kit SDK, blueprints, and learning paths (OpenUSD, Digital Twins, Robotics). 

Expectations Ahead: Deeper generative AI (text-to-sim, Cosmos integration), broader agentic AI for autonomous operations, improved real-time performance on new hardware (Blackwell-era), and expanded ecosystem for AR/VR/spatial computing. It will likely remain central to closing sim-to-real gaps in robotics and scaling industrial AI. 
Omniverse represents NVIDIA's bet on simulation as the foundation for Physical AI. For creators, it's a powerful interoperability and rendering hub; for enterprises and roboticists, it's a production-grade platform driving measurable ROI in the physical world. Explore at nvidia.com/omniverse. 

3D Gaussian Splatting (3DGS): The Breakthrough in Real-Time 3D Scene Representation 3D Gaussian Splatting is a rasterization-based technique for representing and rendering photorealistic 3D scenes from a sparse set of 2D images. Introduced in a seminal 2023 SIGGRAPH paper by researchers from Inria (Kerbl et al.), it has rapidly become a cornerstone of novel view synthesis, digital twins, spatial AI, and immersive media. 
Unlike traditional polygon meshes or implicit neural representations like Neural Radiance Fields (NeRFs), 3DGS uses an explicit collection of millions of anisotropic 3D Gaussians (ellipsoidal "splats") that can be efficiently projected ("splatted") onto the 2D image plane for real-time rendering. How 3D Gaussian Splatting WorksInitialization: Start with a sparse point cloud from Structure-from-Motion (SfM) tools like COLMAP, using multi-view images with known camera poses. 
Gaussian Primitives: Each point becomes a 3D Gaussian defined by:Position (mean/center in 3D space).
Covariance matrix (decomposed into scaling and rotation for shape and orientation — anisotropic, not spherical).
Opacity (alpha/transparency).
Color (often via spherical harmonics for view-dependent effects like specular highlights). 
Optimization: Differentiable rasterization + stochastic gradient descent (no heavy neural networks needed) minimizes a loss combining pixel color differences (L1 + D-SSIM perceptual loss). Adaptive density control adds, clones, or prunes Gaussians during training. 
Rendering: Fast, visibility-aware GPU rasterization (tile-based sorting and alpha blending). Projects 3D Gaussians to 2D, sorts by depth, and composites — achieving 100+ FPS at high resolutions on consumer GPUs. 
This explicit, discrete representation avoids the slow ray-marching of NeRFs while preserving volumetric continuity through overlapping soft ellipsoids. Comparison: 3DGS vs. NeRF vs. Traditional PhotogrammetryVs. NeRFs: NeRFs use implicit MLPs for continuous radiance fields — high quality but slow training/rendering (seconds to minutes per view). 3DGS is explicit, faster to train (minutes vs. hours/days), real-time rendering, and often comparable or better in visual quality with fewer artifacts in many scenes. However, NeRFs can model complex lighting more naturally in some cases. 
Vs. Photogrammetry/Meshes: Meshes excel at geometric accuracy, measurements, and integration into traditional pipelines (e.g., CAD, games). 3DGS captures fine details, transparency, reflections, and volumetric effects better but is less structured for precise engineering measurements. 
Strengths of 3DGS:Real-time performance (30–100+ FPS).
High photorealism from sparse inputs.
Efficient training and lower compute needs.
Editable explicit primitives (move, delete, or modify individual splats). 
Compact storage and fast loading.
Limitations (Mid-2026):Can produce "floaters" or artifacts in under-constrained areas.
Geometric accuracy sometimes lags meshes for metrology.
Large scenes require memory management or tiling.
Dynamic scenes need extensions (4D-GS).
Limited native support in some authoring tools (though improving rapidly). 
Recent Advances and Extensions4D Gaussian Splatting: Adds time dimension for dynamic scenes (moving objects, people) with real-time capabilities. 
Super-resolution, compression, and scalability variants for larger environments or lower-input resolutions. 
Physics integration: Collision meshes, interaction with rigid bodies. 
Hybrid systems: Combine with meshes, neural components, or ray tracing. 
Applications and EcosystemCreative Industries: Film VFX, game development, virtual production, architecture visualization. Real-time novel views from photos. 
World Labs Marble: Uses Gaussian splats (plus collision meshes) for persistent, editable 3D worlds generated from text/images/video. Exports splats for high-fidelity rendering. 
NVIDIA Omniverse / Isaac Sim: NuRec libraries support RTX ray-traced 3DGS for large-scale reconstruction, digital twins, and robotics simulation. Enables photorealistic sensor data and sim-to-real transfer. 
Robotics & Physical AI: Generate diverse training environments, synthetic data, and realistic perception for humanoids/AVs. 
Other: Cultural heritage, spatial journalism, AR/VR, medical visualization, and consumer 3D capture. 
Where We Stand and Future OutlookAs of mid-2026, 3DGS has matured into production tools with plugins for Unreal Engine, Unity, Omniverse, and web viewers (e.g., Spark). It is often described as a "JPEG moment" for spatial computing — fast, high-quality, and democratizing 3D capture. 
Next Few Years: Expect tighter integration with generative AI (text-to-splat), better physics/dynamics, standardized formats, mobile/edge optimization, and hybrid explicit-implicit models. Challenges like scalability for city-scale scenes and full bidirectional editing with traditional 3D pipelines will drive innovation. 
3D Gaussian Splatting has shifted the field from slow, implicit neural rendering toward fast, editable, explicit representations — a foundational technology bridging photogrammetry, AI generation, simulation, and real-time graphics for the spatial and physical AI eras. Open implementations and tools are widely available for experimentation.

4D Gaussian Splatting (4DGS): Extending 3DGS into Dynamic, Time-Evolving Scenes 
4D Gaussian Splatting builds directly on the success of 3D Gaussian Splatting by adding a temporal (time) dimension. While 3DGS excels at static scenes with real-time photorealistic rendering from images, 4DGS reconstructs and renders dynamic scenes — objects moving, deforming, appearing/disappearing, or interacting over time — while preserving high speed, quality, and efficiency. 
Pioneered in works like the CVPR 2024 paper "4D Gaussian Splatting for Real-Time Dynamic Scene Rendering" (often called 4D-GS), it has seen rapid advancements, with variants achieving 30–1000+ FPS in optimized cases. Core Concepts and How It Works4DGS represents a dynamic scene as a collection of 4D Gaussian primitives (or deformed 3D Gaussians over time) instead of millions of independent 3D Gaussians per frame. Key approaches include:
Deformation Fields: A common method maintains a set of canonical (reference) 3D Gaussians and uses a lightweight neural network (often MLP) + spatio-temporal encoders (e.g., inspired by HexPlane or 4D neural voxels) to predict how each Gaussian moves, rotates, scales, or changes opacity/color at any timestamp t. This avoids storing full per-frame data. 
Native 4D Primitives: Some variants use full 4D Gaussians (anisotropic in XYZT space) with 4×4 covariance matrices, allowing direct modeling of space-time ellipsoids. Temporal slicing projects them into 3D at a given time. 
Optimization: Trained on multi-view video (synchronized cameras ideal) or even monocular video with priors. Uses similar differentiable rasterization as 3DGS, plus temporal losses for consistency. Adaptive density control handles birth/death of elements. 
Rendering: Projects the time-specific 3D Gaussians via fast splatting and alpha blending, enabling real-time novel view synthesis of dynamic content (e.g., 82 FPS at 800×800 on RTX 3090 in early models; much higher in optimized 2025+ variants). 
This results in volumetric video — photorealistic, free-viewpoint playback of moving scenes.Strengths and AdvantagesReal-Time Performance: Dramatically faster than dynamic NeRF variants; supports interactive playback and VR/AR. 
High Fidelity: Captures complex motions, deformations, reflections, and fine details (e.g., flowing water, people talking, flickering flames).
Efficiency: Lower storage than per-frame 3DGS; faster training (often minutes to hours for a scene).
Editability: Explicit primitives allow manipulation of dynamics, though more challenging than static 3DGS.
Limitations:Requires good multi-view input for best results (monocular is improving but harder).
Complex/long scenes can still be memory-intensive.
Physics integration or long-horizon prediction is ongoing research.
Artifacts in highly unconstrained areas or rapid motions. 
Applications and Ecosystem IntegrationCreative & Media: Volumetric video for film VFX, virtual production, concerts, sports broadcasting, immersive storytelling. 
Robotics & Physical AI: Dynamic environment simulation in Omniverse/Isaac Sim for training agents on moving objects/people. Integration of 4D splats for realistic sensor data. 
World Labs Marble: Primarily 3D-focused with persistent worlds, but supports dynamic elements and can benefit from 4D techniques for video inputs or animated outputs. Complementary to generative world creation. 
Other: Cultural heritage (reconstructing historical events), AR/VR experiences, medical (dynamic anatomy), and autonomous vehicles.
NVIDIA Omniverse and tools like NuRec increasingly support Gaussian-based reconstructions, with 4D extensions for simulation. Current Standing (Mid-2026) and Future OutlookBy mid-2026, 4DGS has moved from research (hundreds of papers) to early production use in film, broadcasting, and simulation. Optimizations address temporal redundancy (e.g., pruning short-lifespan Gaussians) for higher speeds and scalability. 
Next Few Years:
Tighter integration with generative AI (text/video-to-4D worlds).
Better monocular and casual capture support.
Hybrid physics-aware models for simulation-grade dynamics.
Standardized formats and broader tool support (Unity, Unreal, web viewers).
Consumer/volumetric video platforms for events and personal captures.
4D Gaussian Splatting bridges static 3D capture with true spatiotemporal intelligence, powering more lifelike digital twins, interactive media, and embodied AI training. It represents a key enabler for the shift toward dynamic, world-like AI models discussed in spatial intelligence contexts like World Labs. Open-source repos (e.g., 4DGaussians) and tools make it accessible for experimentation. 
This technology continues to evolve rapidly, promising increasingly seamless reconstruction of the living, moving world around us.

Neural Radiance Fields (NeRF): The Foundational Implicit Representation for Novel View Synthesis 
Neural Radiance Fields, introduced in the seminal 2020 paper "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" by Ben Mildenhall, Pratul Srinivasan, Matthew Tancik, and colleagues, revolutionized 3D scene reconstruction and novel view synthesis. NeRFs represent complex scenes as continuous volumetric functions learned by a neural network, enabling photorealistic rendering from new viewpoints using only a sparse set of 2D input images. How NeRFs WorkA NeRF models a scene as a fully connected multilayer perceptron (MLP) that takes a 5D input — 3D spatial location (x, y, z) plus 2D viewing direction (θ, φ) — and outputs:
Volume density (σ): How much light is absorbed or scattered at that point.
View-dependent emitted radiance (color): RGB values that can change with viewing angle (capturing effects like reflections and specular highlights). 
Rendering Process (Volume Rendering):Cast rays from a virtual camera through each pixel.
Sample points along each ray.
Query the MLP for density and color at each sample.
Accumulate values using classical volume rendering equations (alpha compositing) to produce the final pixel color. 
Training optimizes the MLP by minimizing the difference between rendered and ground-truth images, typically using positional encoding to handle high-frequency details.Key StrengthsPhotorealism: Excels at capturing intricate geometry, lighting, transparency, reflections, and view-dependent effects that traditional photogrammetry or meshes often struggle with. 
Continuous Representation: Implicit and smooth; no discrete geometry needed.
Novel View Synthesis: Generates convincing images from angles not in the training set.
Applications: Virtual production/VFX, digital twins, cultural heritage, robotics simulation, AR/VR, medical visualization. 
Limitations (as of mid-2026)Slow Training and Rendering: Original NeRFs required hours/days to train and seconds per frame to render, limiting real-time use. 
Data Requirements: Performs best with dense, multi-view images under controlled conditions; struggles with sparse or casual captures.
Artifacts: Can produce floaters, aliasing, or poor performance on shiny/reflective surfaces or dynamic scenes.
Scalability: Memory-intensive for large scenes; less editable than explicit representations.
Physics/Geometry: Implicit nature makes precise measurements, collision, or export to traditional pipelines harder. 
Evolution and OptimizationsMany variants address core issues:
Instant-NGP (NVIDIA, 2022): Hash grid encoding + tiny MLPs for training in minutes and faster rendering. 
Dynamic/4D NeRFs: Extensions for moving scenes (though often slower than 4D Gaussian Splatting).
Hybrid Approaches: Combine with explicit elements for better efficiency.
By 2026, NeRFs remain influential in research and specific high-fidelity use cases, but 3D Gaussian Splatting (3DGS) has largely taken over practical applications due to superior speed (minutes training, 100+ FPS rendering), comparable or better quality in many scenes, and easier editing/export. 
NeRF vs. 3D Gaussian Splatting:
Representation: NeRF = implicit neural function; 3DGS = explicit anisotropic 3D Gaussians.
Speed: 3DGS wins decisively (real-time vs. slower NeRF).
Quality: NeRF can edge out in complex lighting; 3DGS often matches or exceeds with fewer artifacts in static scenes.
Use Cases: NeRF for maximum fidelity in offline rendering; 3DGS for interactive, production, and robotics pipelines. 
Integration with Broader EcosystemNVIDIA Omniverse / Isaac Sim: Supports NeRF-like neural volumes via NuRec libraries, alongside strong Gaussian Splatting integration. Used for photorealistic simulation and synthetic data. 
World Labs Marble: Primarily leverages Gaussian Splats for persistent worlds but operates in the same radiance field / spatial intelligence space. Marble scenes (splats) convert easily into Omniverse for robotics training. 
Industry Adoption: Film/VFX (virtual production), architecture, robotics (environment reconstruction), and consumer apps (e.g., smartphone-to-3D).
Current Standing and Outlook (Mid-2026)NeRFs sparked the radiance fields revolution and remain a foundational concept, but the field has shifted toward faster explicit or hybrid methods like Gaussian Splatting for most real-world deployments. Research continues on making NeRFs more efficient, generalizable, and suitable for dynamic/large-scale scenes. 
Next Few Years: Expect tighter hybrids (NeRF + GS), better integration with generative AI (text-to-NeRF/GS), improved dynamics/physics awareness, and broader use in Physical AI training. NeRFs will likely persist in niches requiring ultimate volumetric fidelity, while explicit techniques dominate interactive and industrial applications.
NeRF transformed how we think about 3D from images — moving from discrete geometry to learned, continuous "intelligence" about light and space — paving the way for Large World Models and spatial AI. Open implementations and tools (e.g., via NVIDIA, academic repos) make it highly accessible for experimentation.

Pages

Saturday, June 06, 2026

World Labs' Fei-Fei Li on Creating Large World Models: The Next Frontier Beyond Language