Building the metaverse

15 min readOct 28, 2024

There are a variety of views on both what the metaverse is, who the audiences are and how to build it:

Matthew Ball’s work at https://www.matthewball.co/the-metaverse breaks the metaverse into networking, computing, virtual world engines, interoperability, hardware, payment rails and blockchains.

John Radoff at https://cognizium.io/uploads/resources/Jon%20Radoff%20-%20Building%20the%20Metaverse%20-%202022%20Feb.pdf sees a series of concentric rings, infrastructure at the center, human interface hardware, decentralization using protocols, 3d rendering engines, authoring tools, personal agents and then finally user experiences as a sum of those parts.

The World Economic Forum community have a definition at https://initiatives.weforum.org/defining-and-building-the-metaverse/home that focuses on governance and ‘economic and social value creation’.

https://www.buildingtheopenmetaverse.org/ takes more industrial perspective, focusing on big data, digital twins, scaleability, somewhat less focused on ease of use and more focused on raw capabilities or benefits.

I think we can break all these perspectives into roughly three buckets:

1) Executive or ‘platform’ vision:

Executives often talk about offering a general “platform”, or even a meta-platform (a platform for building platforms). At an executive language is general and aspirational, intended to motivate the troops.

But let’s unpack this a bit:

Developers here are given a mandate to provide a service that “lets anybody build anything”. More specifically non-technical people should be able to compose and share a rich media experience as easily as creating music, making a drawing or writing a poem.(*)

(*) This is in fact something that the software industry pursues, but as programmers we have yet to deliver. It just isn’t easy to build powerful experiences yet. We have seen some efforts, ranging from flash to the web, to authoring tools like Unity3D, but we still fall short of general rich media composability, let alone publishing at scale, especially where 3d is involved. Concerns involve limited device capabilities, hosting, authentication, ownership of digital assets and a wide range of other technical topics.

The term platform comes up quite a bit, and it is worth poking at. A “platform” is a collection of tools that work together to allow people to compose “applications” or “products”. Building a platform is seductive in that the conceit is that by abstracting out the commonalities one can service a wide range of products.

One way of thinking about a platform-play (as opposed to a single product) is to arrange the kinds of product that exist today as a spectrum or a “long tail” with popularity as one axis and volume as the other axis:

On the Left are the most popular visible products. Think AAA video games such as EA FIFA, built with large teams, requiring huge coordination efforts, with team sizes in the hundreds, including programmers, designers, artists, animators, producers, and just the salaries alone can eat up tens of millions of dollars. There is significant quality assurance, bug fixing, polishing, R&D in general, middleware licensing. Then on top of that you have licensing, partnerships for more tens of millions, and then marketing and promotion, again more tens of millions — in fact often the predominant cost — usually production runs are only 10–20% development cost and the rest is marketing and licensing, for total production costs of hundreds of millions of dollars, and with corresponding rewards in the billion dollar range.
On the Right: Towards the right end are millions of small apps, web-apps, games, activations, even websites as a whole. These experiences are built by companies or individuals, each for their own reasons, sometimes as an ‘activation’ or marketing brand outreach, sometimes to explain something, sometimes for fun, often for money. These are distributed across app stores, native and on the web. Some of these may be for industrial uses, digital-twins and more narrow markets. These are lower risk, lower reward, and also have less coordination energy to reason about and build.

The numbers on the left of the spectrum are eye-watering, but the total valuation and “volume” on the right as a whole is actually larger than the left edge. Strategically this can inform our work.

Overall a business goal of an executive is to service as much of the “spectrum” of rich experiences as possible — as easily as possible. But unless you have billions of dollars you’re simply not going to be outperforming Unreal or Unity or custom tool suites developed by thousands of engineers at EA — any more than a person with a camcorder is going to outperform James Cameron, Pixar and Universal Studios — so placing your bets wisely is important.

2) Product vision:

Product and marketing staff, or “business development” staff, are keenly aware of potential customer opportunities, and want to service those customers.

Product vision is actually different from the executive vision, and different from a technical vision, because it is solely focused on customer needs. Product doesn’t care what the technology is in fact. Product owns the customer and is beholden to the customer. Product sees the spectrum of possible experiences in that they are being pinged by customer requests every day.

In turn the “customer” — the brands, store-owners, merchants (anybody that wants to deploy a rich experience) — has a set of very specific goals. They are non-technical, so they want a no-code solution, or a white-gloves solution. They want to be able to do fun novel compelling stuff — immersive 3d brand activations, virtual store-fronts, ai llm driven concierge like docents and brand ambassadors — but also — specifically — they want to own their own data, own their own customers.

Customers are not necessarily in mass market consumer retail. Some people want to build digital twins, simulations, do civic outreach to regional stakeholders — but consumer retail tends to have huge variety.

This part of the ecosystem is a rich froth with many stakeholders; there are brands that want product, there are consumers facing bewildering technical choices, there are marketers, resellers, tons of legal compliance issues and policing. And in fact success here often means testing against the market — building tools that let people play — and seeing where they respond. It’s just hard to anticipate and pre-plan.

From a technical perspective it is worth contrasting what product sellers and consumers want with where we are today. It’s extremely hard to “stand up” a rich media experience today, especially if there is multiplayer or 3d involved. You need to manage programmers, and custom build a bespoke world, and stage it, and deal with authentication and privacy and compliance issues.

Existing metaverse-like service providers such as Facebook Horizons or even Roblox exist only in walled gardens with exorbitant fees and a lack of control, and are difficult to access universally on a variety of devices, it is hard to login, switch between worlds with a persistent identity and so on.

Generally we are still in a kind of “dark ages” of rich media services. The metaverse kind of sucks today frankly.

3) A technical vision:

Technically we as developers also are also keenly aware of the “spectrum” of possible applications and we do want to service as much of that spectrum as possible. We do seek to build extensible frameworks out of reusable parts that anybody can use to assemble their own experiences.

People have in fact been focusing very very hard on trying to solve interoperability — it could be almost said to be the main challenge of software development as a whole.

An example of a team that took a serious run at this is Improbable — they tried to put together the whole picture, public ledger based digital artifacts, networking at scale, immersive 3d and the like: https://en.wikipedia.org/wiki/Improbable_(company) . And if you look at Tim Sweeny and the work at Unreal you also see similar heroic efforts. Yet somehow these best effort attempts have not succeeded in building a durable public platform. Where successes have occurred they seem to be at the ‘protocol’ level — such as the web itself.

What does it take to deliver core functionality for the metaverse?

We can (very roughly) break down a metaverse platform offering into a layer cake like so:

Core Engine

These days we tend to formalize the idea of an “engine” that “produces” the experience from some collection of parts, such as say from a manifest or a database.

The engine generally curates the user experience. Typical applications range from digital twins, such as modeling international shipping, or a brand activation, or a game.

IR-Engine [ https://github.com/ir-engine ] (which I contribute to) is an example of very nice data driven, ECS component architecture using React patterns. This allows for a separation between the core state of a system and the rendering or display of that system. But there are many other engines as well.

I think we’re all getting to an understanding that a core engine is itself really just a state manager — a micro-kernel in that it doesn’t really have to think about 3d graphics as a core capability, and that is it more concerned with loading and unloading agents and making sure they don’t stomp on each other. It looks increasingly like the core engine for a metaverse is similar to the core engine for Linux or other operating systems — the fundamental concerns are around state management and less around painting graphics.

Components

The kinds of things we want to turn into lego-like reusable components include images, sounds, 2d layout widgets, 3d objects, custom behaviors, llms and digital agents, shopping cart controls, motion capture capabilities, avatars, path finding and navigation, effects and lighting, user authentication, networking including webrtc… In some cases we want to integrate extremely large GIS datasets, or satellite data. There’s a need for spatial querying and indexing as well.

There are some challenges with this — organizing assets in a human comprehensible way, dealing with compilation of assets, provenance and rights issues in some cases, standardization of formats and protocols.

In particular one challenge is organizing dynamic behaviors, and even having public registries of those behaviors. This itself is a fairly ambitious goal, but it does reflect general trends in computation. We do see a lot of formalism around packaging (such as npm registries).

I feel like we’re getting closer far as component architectures; we have ways of late loading resources, scheduling and marshaling them. Arguably what is needed is some kind of public registry — although people are starting to use NPM itself as a registry not just for nodejs packages but also for a wider variety of small code blobs. Also there’s a need for strong security permissions for multi-participant scenarios.

Another challenge is durable dynamic behavior. Today we live in an internet that is largely focused on static artifacts, or artifacts that only are responsive while you are focused on them. Even tools like ESRI tend to operate on static data.

But the metaverse will soon consist of durable behaviors written by many different people co-existing in a group sand-boxes, interacting with each other, advocating on our behalf, and running even when we are not looking at them.

We will want to get to a point where we can readily load millions of autonomous digital agents into a shared computational soup, dynamically, on the fly, with custom behaviors, and strong permissions boundaries around each agent — where the “sandbox” never comes down and is never rebooted, but rather the agents are updated dynamically in a persistent durable always on metaverse.

Manifests

We typically need some kind of way of organizing the assets of an app in a human comprehensible way and then fetching them in an organized way.

A rich media experience can be described from a starting manifest that cites potentially thousands of dependencies — all of the 2d and 3d elements, pages, triggers, art, audio, video. Effectively at the core of any experience is a “database” of assets, all marked up in a way that they can be fetched on demand as needed.

Often the filesystem is used but sometimes fancy CMS systems or 3d world authoring tools sit on top of the filesystem and take on this organizational role. Sometimes a simple text file with a custom “scenario definition language” describes what assets to load for a given scene.

In general there are some maxims for a manifest grammar. Manifests that describe collections of digital assets must be in an application neutral text format, they must be easy for a user to edit in a text editor, able to be reconciled with merges in git, and must be capable of storing arbitrary state.

While manifests can be inhaled into a database, a manifest fundamentally is a “bridge format” for human consumption. It is a fatal flaw to use a binary blob for a manifest, or a gigantic text document that is too large for a human to search through or digest, or to only use a database and have no transport grammar. Manifests are at a ‘protocol level’ as well in that they are public, formal, and any namespaces in them should have public registries.

Authoring

I tend to think that 3d world editors are not as necessary as people might think. Only the simplest of toy scenes can be authored in these tools. They are useful, but usually insufficient.

I also tend to see interactive 3d editors as a surrogate for manifest formats that are too complex, or where there are not other ways to parametrically express truly complex layouts, the kinds of layouts that typify real world use cases.

Building worlds “at scale” should be about as easy as describing them in english; the grammar we use should be simple, accessible, composable, resistant to errors. These also should be multiplayer experiences, that are active all the time — not with a separation between play and construction.

I don’t think we are there yet in terms of an editable metaverse, but it feels like we will get there soon. One observation here is an idea of “intent based design” or “semantic intent”. Ordinary people want to say “put a nice couch against the wall; make it leather, make it red” and “put a clock on the wall; make it an analog clock”. They don’t want to scour a marketplace for assets, have to pay for that asset, place the asset at an xyz location, learn how to rotate, scale, and learn 3d concepts. They want to supply a photograph and say “make it look like that”.

Publishing

Beyond this we want to stage and deploy an experience in a single click at most. And we want a durable experience that will never fall over. If we’re falling short of an editable metaverse, or we want to stand up a private experience (such as a brand activation) then the right pattern is basically to follow what is being done for the web today as a whole — where we have extremely mature deployment pipelines for web apps.

One challenge for metaverse apps is that they occasionally need a ‘center’ — they need to run behavior on a server — such as for a multiplayer experience. In a ‘long tail’ of user experiences probably about 10% or 20% of experiences have this requirement.

In this scenario it makes sense to offer standalone process isolated docker containers for each server or ‘fragment’ of the metaverse — and to find ways to durably persist user identity across those boundaries at a protocol level such as by using public ledgers.

One complaint or criticism that store-owners have is that in centralized or walled garden solutions (such as Facebook Horizons or Roblox) that they don’t own their data or their customers. They can’t always get clear foot-traffic analytics of who visited their store but didn’t make a purchase, they can’t reach out to customers easily. Much of this is due to legitimate privacy concerns, but also there’s a tendency for walled gardens to hold data hostage. So it is important when offering a metaverse server product to allow people to stage and host their own instances easily.

Of course to deliver experiences at a scale can require servers that can load-balance requests around the world; moving computation closer to customers. Also for multiplayer experiences there are latency concerns, and sometimes geographic segmentation or zoning of traffic. Not all challenges here are solved, but for the long tail of consumer experiences there are solutions.

Identity.

Managing trust can be challenging. Often there needs to be hand-off between different tiers of relationships. For example an experience may be a 3d immersive shopping cart, that then has to transition to a third-party hosted checkout service (such as shopify). This requires managing session keys in a secure way, and remembering users when they return, or when there are changes across disjoint databases.

An extension of trust is social trust graphs for filtering bad actors as well.

Today authentication should use public ledgers if it wants to be operating at a protocol level, and to be truly part of a durable public commons. It is reasonable to offer hosted authentication but users should be able to export their ‘wallets’ and own their data.

As much of this should be at a protocol level as possible of course.

But does a technical laundry list deliver the Metaverse?

Let’s back up and review the Metaverse itself. I didn’t want to bother with this earlier because we are all technical and we all understand this. But I want to be a bit specific because later in this essay I want to close on a broader vision of how we all succeed in delivering this vision.

The term metaverse in a stricter sense refers to an idea of a shared virtual space with 3d avatars. Often it is painted as a kind of William Gibson Sci-Fi ‘cyberspace’ inhabited by surrealistic and futuristic representations of the artifacts, concepts, people and other entities. A key concept is that this is a single shared space that participants can move seamlessly within and build within.

If we ask chatgpt for an abbreviated list of concerns we see topics such as ‘distributed back end’, ‘load balancing’, ‘spatial partitioning’, ‘persistent data’, ‘versioning of assets’, ‘real time networking’, ‘physics engine’, ‘synchronization of state’, ‘state prediction’, ‘content creation and content moderation’, ‘customizability’, ‘interoperability of standards’, ‘privacy’, ‘security’, ‘economy and asset management’, ‘in world assets’, ‘ux design’, ‘accessibility’, ‘legal and ethical concerns’, ‘ip rights’…

But even this isn’t a full list; in fact the way we use the term ‘metaverse’ is more broad than a 3d shared space. For our purposes the metaverse brings together most of not all of our existing media-types. What we think of as separate industries: story-telling, movies, video-games and social media networks are all in fact aspects of a broader emerging media type; a kind of multimedia, or rich-media compositional framework that is more disjoint, with separate worlds for separate purposes.

One of my favorite recent projects I’ve seen is a multi-player elder-care physiotherapy app that uses motion-capture and VR in a therapeutic setting. I think this is an excellent definition of a ‘place’ or metaverse — even though only the physiotherapist and the patient inhabit it.

By such definition we already inhabit a ‘metaverse’ — in the sense that we exchange rich media products every day — it is just not fully mature yet. Timothy Morton writes about a similar concept in his book “hyper objects” — where he observes that there are entities so vast, operating on such large time scales, that we don’t fully perceive them.

Igniting the Metaverse

Software considerations are just part of the puzzle. There’s a deeper question of how do we move into a metaverse. Even if we have a perfectly executing piece of architecture, a perfect platform, that anybody can in fact use to build anything — how do we attract, support and ‘ignite’ that community?

If we ask chatgpt about reasons the metaverse could fail it reports ‘a lack of compelling content’, ‘weak community and cultural fit’, ‘economic model misalignment’, ‘insufficient marketing’, ‘limited interoperability’, ‘limited accessibility’, ‘regulatory pushback’, ‘failure to build trust’, ‘lack of real world utility or relevance’.

And contrawise, reasons for success could be ‘partnerships with artists, brand and creators to supply continouous content’, ‘inclusive and welcoming culture’, ‘transparent and fair economic system’, ‘cross platform support’, ‘responsive moderation and governance’, ‘clear vision’, ‘real world utility and relevance’, ‘rapid iteration’, ‘trustworthy leadership’.

Very few of these criteria are captured in a technical laundry list.

I think it comes down to rewarding risk and experimentation — I would argue that developers are very good at delivering on well defined goals if they are empowered to do so, and if the goals are clearly stated.

Individual companies probably need an internal formalization of a process for picking winners — an internal venture wing, or an internal product approval process, where a project proposal is written up, with a clear description of labor, risk, financial rewards, and then projects are voted on, funded or not funded, and resourced or not. Many companies are too focused on either technology, or on servicing customers, and have extremely muddy internal processes for innovation, or productization or public exercises of internal ideas, even though often it is the internal idea pool that has inspired the work in the first place; these companies need to refine their pipelines for how they approve ideas and fund risk in order to not miss market opportunity. Acquisitions are one way to bring innovation in house, but it may also make sense to fund spin-offs as well.

At a public level the venture community has a fairly well defined process for submission of proposals, and often work can be funded in that way. I personally find the venture community to be highly rewarding and very clear in goals and outcomes.

There are also companies that want to offer metaverse based experiences, and it is possible to simply service those companies — to build for hire. This is also another excellent way to build the metaverse.

Another technique that works to varying degrees is hackathons, public prizes and rewards such as X prizes, or even retroactive rewards. There are many public interest communities in the metaverse space (such as the AWE community) that can take a strong role in the stewardship here.

Effectively if we succeed we will have built a kind of piano, or a kind of instrument, and we have to teach people how to play, and the best way is to play together, to have many people experimenting and trying ideas, in public, at low risk, so that we can collectively create this future.