The Direct from Imagination Era Has Begun | by Jon Radoff | Building the Metaverse

When I was a kid, I wanted to build a holodeck — the immersive 3D simulation system from Star Trek, so I started making games, beginning with online multiplayer games for bulletin board systems. Eventually, I even got to make a massively multiplayer mobile game based on Star Trek that a few million people played.

Although one feature of the holodeck — manipulating physical force fields — may remain the domain of science fiction — just about everything else is rapidly becoming technological reality (just as Star Trek foresaw so many other things, like mobile phones, voice recognition and tablet-based computers).

What we’re entering is what I call the direct-from-imagination era: you’ll speak entire worlds into existence.

This is core to my creative vision for what the metaverse is really about: not only a convergence of technologies, but a place for us to express our digital identities creatively.

This article explores the technological and business trends enabling this future.

You’d need a way to generate and compose ideas: “Computer, make me a fantasy world with elves and dragons… except made of Legos.”
You’d need a way to visualize the experience. Physics and realistic light simulation (ray tracing)
You’d need a way to have a persistent world with data, continuity, rules, systems.

Let’s look at a couple of the ways that generative artificial intelligence taps into your creativity:

ChatGPT as a Virtual Engine

There are a number of ways to conceptualize a large language model like ChatGPT; but one is that it is actually a virtual world engine. An example of this is how it can be used to dream of virtual machines and text adventure games.

From my article: “Creating a Text Adventure in ChatGPT”

Lensa and Self-Expression

Lensa grew to tens of millions of revenue in only a few weeks. It lets you imagine different versions of yourself and share it with your friends, gratifying our egos and our creativity.

Its enabling technology, Stable Diffusion, is disruptive because it dramatically reduces the cost of generating artwork, enabling new use cases like Lensa.

Lensa’s growth, with examples of its output based on the author.

Diagram of only a few of the steps in building a virtual world

The above diagram illustrates just a few of the steps involved in creating a virtual world: a game, an MMORPG, a simulation, a metaverse, or whatever term you prefer. There are actually far more iterative loops and revisitations to earlier phases throughout the process of building a large world, and a few types of content are left out (for example, audio).

A number of emerging technologies — not only generative AI, but advancements in compositional frameworks, computer graphics and parallel computation — are organizing, simplifying and eliminating formerly labor-intensive elements of the process. The impact of this is vast: not only accelerating production velocity and reducing costs, but enabling new use cases.

Before generative AI and “professional” platforms for worldbuilding became available, a number of other platforms existed. Let’s take a look at those:

Dungeons & Dragons

I’ve often called D&D the first metaverse: it’s an imaginative space with enough structure to allow collaborative storytelling and simulation. It had persistent, virtual worlds called campaigns.

For its first few decades, it was mostly non-digital. But more recently, tools have helped dematerialize the experience and make it more easy to conduct your campaign online. Generative AI tools have also helped dungeon masters create imagery to share with their groups.

Roll20 virtuable tabletop platform to enable D&D and other games online, along with a couple D&D-inspired graphics I made.

Minecraft: the Sandbox

Minecraft is not only a creative tapestry for individuals — it is a space of shared imagination where people compose vast worlds.

Screens here are taken from Divine Journey 2, a colossal modpack composed of many other mods and deployed on servers for players to experience together:

Roblox: the Walled Garden

Roblox is not a game — it is a multiverse of games, each created by members of its community.

Many of the most popular experiences of Roblox are not “games” in the traditional sense.

Many would not have gotten greenlit in the mainstream game publishing business — but in a shared space of creativity, new types of virtual worlds flourish.

Popular games on Roblox in January 2023. “Adopt Me!” is a good example of something that probably never would have been funded by a traditional game publisher.

3D Engines

A decade ago, if you wanted to build an immersive world in 3D, you’d need to know a lot about graphics APIs and matrix math.

For people who couldn’t realize their creativity in a sandbox or walled-garden — platforms like Unreal and Unity enable the creation of real-time, immersive worlds that simulate reality.

Persistent Worlds

3D engines provide a window into a world. But the memory of what happens in a world — the history, economy, social structure — as well as the rules that undergird a world, require a means of achieving consensus between all participants.

Walled-gardens like Roblox do this for you: but large-scale worlds have required the work of large engineering teams who build from scratch.

Compositional Frameworks will use generative AI to accelerate the worldbuilding process; begin with words, refine with words.

Physics-based methods such as ray tracing will simplify the creative process while delivering amazing experiences.

Generative AI will become part of the loop of games and online experiences, creating undreamt-of interactive forms.

Compute-on-demand will enable scalable, persistent worlds with whatever structure the creator imagines.

Computers can dream of worlds — and we can see into them — due to advances in parallel computing.

The next few sections will explain the exponential rise in computation — in your devices and in the cloud — driving the direct-from-imagination revolution, and then return to what the near-term future has in store.

Compute before 2020 is a rounding error vs. today

The top 500 supercomputing clusters in the world show us the exponential rise in computing power over the last few decades.

Top500 supercomputing clusters (Q4 2022), showing the total, largest and the smallest.

However, the top 500 only captures a small fraction of the overall compute that’s available. A few metrics to be aware of (Discussion):

Most 2022 phones had 2+ TFLOPs* of compute (2×10¹²) which is 100,000,000 faster than the computer that sent Apollo to the moon
The Frontier supercomputer passed 1.0 exaflops (10¹⁸)
1.5 exaflops on the “virtual supercomputer” that combined for the Folding@Home Covid 19 simulation.
Top500 Supercomputing clusters add up to ~10 exaflops
NVIDIA RTX-4090’s shipped at least 13 exaflops
Playstation 5’s combined surpasses 250 exaflops
Apple ships over 1 zettaflop (10²¹) of compute in 2022
Intel is working toward a zettaflop supercomputer

By 2027, hundreds of zettaflops seems plausible. By then, compute at the start of 2023 compute will seem like a rounding error again.

Technical note: in all these comps I blur single vs. double precision & matrix vs. vector ops, so it isn’t apples-to-apple. This will be a topic for a future post on global compute; meanwhile, this still ought to provide a rough order-of-magnitude.

Parallel Computation

Much of the increase in global computation capacity has occurred because of parallel computation. Within parallel computation there are two main types:

Programs that are especially parallel-compatible: this includes just about everything that benefits from matrix math, such as as graphics and artificial intelligence. This software benefits from adding lots of GPU cores (the cores themselves keep getting more specialized, like Tensor cores for AI, or raytracing cores for realtime physics-based rendering).
Programs that remain CPU-bound (more complicated programs with a lot of steps along the way), which benefit from multiple CPU cores.

*Or one of the several CUDA competitors
**Just for the visual example. In 2023, an NVIDIA RTX 4090 has a lot more. 16,384 cores!

Although having multiple CPU cores has helped us run multitasking programs more efficiently, most of the continuity of Moore’s Law is the result of the growth in GPUs.

However, AI models are growing a lot faster than GPU performance:

“On the Opportunities and Risks of Foundation Models,” Narayanan et al

Fortunately, cost per FLOP is decreasing at the same time:

Source: https://epochai.org/blog/trends-in-gpu-price-performance

Similarly, algorithms are getting much better. ImageNet training costs have decreased more than 95% over 4 years, and new AI are even discovering new and more efficient ways of running themselves:

Scaling Parallel Compute

For huge workloads — like training a huge AI model or running a persistent virtual world for millions of people — you have a couple main options:

Build an actual supercomputer (CPUs/CPUs all in one location, which needs high speed interconnects and shared memory spaces). Currently, this is needed for workloads like certain kinds of simulations or training large models like GPT-3.

Build a virtual supercomputer. Examples:

Folding@Home, an example of a distributed set of workloads which can be performed asynchronously and without shared memory. This approach is good for huge workloads when latency and shared memory don’t matter much. Folding@Home was able to simulate proteins for 0.1 seconds by distributing the workload across >1 exaflop of citizen-scientist computers on the internet.
Ethereum network — good for cryptographic and smart contract workloads
Put code into containers and orchestrate them over large CPU capacity using Kubernetes, Docker Swarm, Amazon ECS/EKS, etc.

Because of the combination of innovation in speed/density computing cores (mostly GPUs) alongside networking GPUs together into supercomputers, we’ve exponentially increased the amount of compute that’s available. This is illustrated in Gwern’s diagram of the power of the supercomputing clusters used to train the largest AI models created so far:

Scaling for Users

When a workload is more compute-bound, but can be broken down into separate containers (containing microservices, lambdas, etc.) you can use orchestration technologies like Kubernetes and Amazon ECS to rapidly deploy a large number of virtual machines to service demand. This is mostly useful for making software available for large number of users (rather than simply making software run faster). How many virtual machines? This chart gives you an idea of how quickly one can provision containers using state-of-the-art orchestration technologies in large datacenters:

Source: “Scaling Containers on AWS in 2022,” Vlad Ionescu, https://www.vladionescu.me/posts/scaling-containers-on-aws-in-2022

Vast worlds may be simulated on-device

It’s important to understand all aspects of this rise in compute. Cloud-based capacity and actual supercomputers are enabling training huge models and unifying applications that need to be accessed by millions of concurrent users.

However, the exponential rise in compute at the edge and in your own devices is just as important for building metaverses and holodecks. Here’s why: many things are simply done most-efficiently right in front of you. For one reason, there’s the speed of light: we’ll never be able to do things like generate real-time graphics as quickly in the cloud and ship it to you as we can on your device, not to mention that it is far more bandwidth-efficient to use the network to provide updates to geometry/vectors than it is to provide rasterized images. And many of the more interesting applications will need to perform local inference and localized graphics computation; cloud-based approaches will simply be too slow, cumbersome or violate privacy norms.

At the same time as our local hardware is getting better, the software is also improving at an exponential rate. This is illustrated in Unreal Engine, which has a few key features worth noting:

World partitioning allows open worlds to be stitched together
Nanite allows designers to create images of any geometry and place it in any world (cuts down on the optimization passes, constantly refining objects to lower polygons, etc., that comes up in realtime graphics systems).
Lumen is a global illumination system that uses ray tracing. It looks amazing, runs on consumer hardware, and spares developers from having to “bake” lighting before each build. The reason the latter is important is that most realtime lighting systems in use today (such as in games) require a time-consuming “baking” process to pre-calculate lighting in an environment before shipping the graphics to the user. This is not only a nuisance from a productivity standpoint, it also limits the extent of your creativity: dynamic global illumination means you can also have environments that change dynamically (e.g., allowing people to build their own structures within a virtual world).

Realtime ray tracing was a demo in 2018 that required a $60,000 PC. Now, it’s possible on a PlayStation 5. Technologies like Lumen, as well as more specialized GPUs such as that found in the NVIDIA RTX-4090, demonstrate how far both physics-based hardware and software have come in a short period of time.

Similarly, these improvements will not simply be a cloud-based realm of AI model-training and Web-based inference apps like ChatGPT. Hardware and algorithm improvements will make it possible to train your own models for your team, your game studio and even yourself; and on-device inference will unlock games, applications and virtual world experiences that were only dreams until recently.

A complete tour of generative AI would fill a bookshelf (or maybe even a whole library). I want to share a few examples of how generative technologies will replace some of the steps in the production process of building virtual worlds.

At this point, you’ve probably been bombarded with AI-generated art. But it’s worth a reminder of just how far use cases like concept-art generation have come in only a year:

Midjourney concept art for a “charming wizard”

3D Generative Art

One must distinguish between generative art that “looks like 3D” and art that is actually 3D. The former is simply another example of a generative 2D art; the latter uses a mesh geometry to render scenes with physics-based lighting systems. We’ll need that to build virtual worlds.

This is a domain that’s still in its infancy, but remember how quickly 2D developer. This is likely to improve dramatically in the near future. OpenAI has already demonstrated the ability to generate point clouds of 3D objects from a text prompt:

Neural Radiance Fields (NeRF)

Source: https://www.matthewtancik.com/nerf

NeRF generates 3D scenes and meshes generated from 2D images taken from small number of viewpoints. The simplest way to think about NeRF’s is that it is “inverse ray tracing,” where the 3D structure of a scene is learned from the way light falls on different cameras. Some of the applications include:

Make 3D creation accessible to photographers — more storytelling and virtual world content
An alternative to complicated photogrammetry

Beyond the immediate applications, reverse ray-tracing is a domain that will eventually help us generate accurate 3D models based on photos.

Text-to-NeRF

Natural language is becoming the unifying interface for many of the generative technologies, and NeRF is another example of that:

From: “DreamFusion: Text-to-3D using 2D Diffusion,” Poole, et al

AI Could Generate Entire Multiplayer Worlds

Text interfaces will also become a means of organizing larger-scale compositions. At Beamable, we made a proof-of-concept illustrating how you could use ChatGPT to generate the Unreal Engine Blueprints that would include the components necessary to build persistent virtual worlds:

AI can play sophisticated social games

In 2022, Meta AI showed that an AI (CICERO) could be trained on games recorded in a Web-based Diplomacy platform. This requires a combination of strategic reason as well as Natural Language Processing. This hints at a future with AI that will:

Help you work through longer, more-complex plans like composing an entire world
Participate “in-the-loop” of virtual experiences and games, acting as social collaborators and competitors

AI can learn and play compositional methods

In 2022, OpenAI demonstrated through a method called Video Pre-Training (VPT) that an AI could learn to play Minecraft.

This resulted in the ability to perform common gameplay behaviors — as well as compositional activities like building a base.

This further reinforces the idea of AI-based virtual beings that can populate worlds — as well as act as partners in the creative process.

AI Can Watch Videos to Make a Game

In a demo called GTA V: GAN Theft Auto, an AI was trained to watch videos of Grand Theft Auto. It learned to play the game, and from the learning process it was also able to generate a game based on what it saw. The result was a bit rough, but it’s still extremely compelling to imagine how this will improve over time.

Source: https://github.com/sentdex/GANTheftAuto/

Real-Time Compositional Frameworks

What happens when you combine the ability to do real-time ray tracing, generative AI within an online compositional framework? You should just watch this video of NVIDIA’s Omniverse platform for yourself: