Why avatars are the future of videoconferencing

A speculative look at the future of telepresence, and how the introduction of avatars can give us improvements in latency, image quality, eye contact and more.

Introduction: Videoconferencing has itches that need scratching

I was recently given a tour at a facility that makes high quality videoconferencing. It’s a healthy business with an innovative culture and a sensible product roadmap. However, all suppliers experience the same quality issues that leaves their services wanting more. My takeaway is that videoconferencing will benefit from an added layer of intelligence. Videoconferencing needs avatars.

The itches

Before exploring how avatars can help videoconferencing, let’s look at today’s issues, that are persistent across all platforms and set-ups.

  • Latency & bandwidth demand. Latency is the time lapse or delay that can occur when talking across great distances. For landline voice calls, it’s not a problem. But if you switch to voice by satelitte, e.g. when we Norwegians talk to friends on offshore oil installations, the delay makes the conversation awkward. In videoconferencing, such latency occurs due to two reasons: Compression time and transfer time. Compression latency is becoming negligible because of modern graphic processors. However, videoconferencing hogs bandwidth, and therefore, even as networks get faster, transfer time will remain an issue unless we get smarter addressing it. For mobile devices, even one-to-many conversations (multicast) are difficult on reasonably modern (3G) connections.
  • Low resolution and compression artefacts. Even at high end facilities, on-premise conversations sometimes have artefacts, removing the illusion of “telepresence”.
  • Lack of eye contact, which occurs when the camera is placed noticeably away from the other person’s eyes. This is an issue for virtually all systems, from high end setups to iPhones.
  • Lack of spatial prioritisation of humans. Today, if more that two people are participating in a videoconference, a 4:3 or 16:9 ratio screen has trouble presenting these in a meaningful manner. Videoconferencing suppliers may protest, saying they prioritise based on who is speaking. But as we will see, a more refined spatial arrangement is possible.
  • Other video quality issues, such as poor brightness control.
  • Storage issues. High quality video cannot be stored in abundance on portable media. And finally:
  • Cooperation difficulties. Videoconferencing works quite well for meeting other people, but not for actually working together. After all, people are not in the same room, yielding difficulties in working on the same set of documents.

How to scratch: Introduce avatars in the codec

All the issues mentioned above can be addressed by introducing avatars in the codec.

The avatars are not at all supposed to be perceived as such by the users of the system. Rather, they are meant to visualize humans in a way that is more natural to other people, while requiring less information to be transfered between points. The system may also use the avatars to rearrange the scene. Let’s explore the mechanisms before we see how this is beneficial.


Today, video streams are compressed similarly to images. The compression is raster-based and lossy, such as JPEG, and not vector-based and lossless, such as SWF. Video streams are also compressed along time, except when introducing a complete picture, a keyframe.

One modern format in use by state of the art systems is H.265. H.265 has no understanding of the human form. Therefore, it starts from scratch every time it has to relay visual information about a person. Metaphorically, if the H.265 compression algorithm was a person, you could have a conversation with it about colors and patterns, but every time you wanted to talk to it about someone, you would have to begin with the basic description of human beings in general. Of course, if the algorithm was able to remember what a human was, and even distinguish individuals apart, this would simplify the conversation. This simplification, and the benefits it hopefully yields, is what this blog post is all about.

To teach videoconferencing systems about humans, avatars should be introduced as an added layer in the codec. The avatars are used in conjunction with the existing video compression.

An avatar would share similarities to SWF in the way that it is partly vector-based and lossless (by use of a 3D polygon mesh representing the human). In addition to the polygon mesh, texture maps (diffuse, specular, bump) and possibly lighting approximations are incorporated. Also, a library of expressions should be available.

The polygon & texture maps, and possibly the set of expressions, should be tailored to the individual users. In a corporate environment, one might envision sending the employees to the 3D photographer, thus creating a visual library of people susceptible of joining videoconferences.

When entering a videoconference, the system performs face and posture recognition of the users. It builds skeletal representations of the human participants, similar to what Microsoft’s Kinetic does, but with more emphasis on the face, including expressions.

Having done this, the system, in real time, renders the avatars. If the rendering is sufficiently realistic, only the avatar information is needed to visualize the human on the recipient’s end. A relevant demonstration, the Digital Emily Project, was made in 2008 in a post processing (as opposed to real time) environment.

On the other hand, if the rendering is slightly unrealistic, people will feel uncomfortable about it. It will look like a robot pretending to be human, Max Headroom style. Such robots fall into the undesirable Uncanny Valley.

If this is the case, the system can calculate the difference (or Δ, delta) between the (unrealistic) avatar rendering and the video image of the user. The avatar information and the delta stream, encoded e.g. in H.265, is transferred across the network to the recipient. These are joined by the recipient system to recreate the video image.

In a very crude experiment (below), we compare the image quality and file size between a JPEG image and an avatar-based codec. The file size for the avatar-based codec is only given for the delta file/stream. I have no way of calculating the file size of the avatar part of the codec (initial face mesh + texture + position + movement vector). This has to be small, relatively speaking, for the concept to have value.

“Original” and “avatar” images are courtesy of the Digital Emily Project.

Observations when studying these images:

  • Fundamental to the rationale for this process is that the avatar information and the delta stream combined is significantly smaller in size or/and has significantly higher visual quality than the corresponding non-avatar video codec.
  • In the image of the simulated avatar based codec (bottom right), the most annoying compression artefacts are around eyes, nostrils and mouth. These areas were not given extra priority in this experiment. Since the system is anatomy aware, this can be improved. For such a system, this opportunity is also available for a normal video stream.
  • The “photo” of the woman is, in fact, a rendering (for this experiment, I could not find a suitable pair of photo/ render comparison images). Also, the “avatar” rendering is just using a low quality “diffuse” technique. This makes the experiment unrealistic in a number of ways:
  • A rendering can be much more realistic than the “avatar” image used here, giving a higher yield.
  • In this experiment, the “original” and “avatar” images share polygon mesh, so both the head alignment, face features and expression match perfectly. This is highly unrealistic for a system that matches an avatar to a video stream in real time. In a real setting, these factors will therefore contribute to lower yield. In fact, if the avatar’s features are misaligned by more than a couple of pixels, the effort to correct this may be so costly (in terms of the delta stream’s size) that the benefits of the avatar is outweighed.

Secondary mechanics:

  • In general, prioritize crucial human elements (mouth, eyes, face region). Treat eyes and mouth with higher resolution and less compression than other face features, and have a higher frame rate for the face in total than for other parts of the video.
  • Virtually move the camera for eye-to-eye contact.
  • Adjust brightness and contrast to the face portion of the image.
  • Place people closer together in the picture. E.g., for the Microsoft/Polycom RoundTable system, the video strip showing the people not talking is nearly unusable, because it mostly shows the office wall. If empty sections are cropped away, the spatial usefulness improves vastly for office spaces not specifically tailored for videoconferencing.

For the first versions of the system, the benefit/yield may not be great, even nonexistent. As refinements are implemented in image & motion capture and the avatar codec, the optional delta stream should contain less and less information.

In situations where the avatar representation fails, the existing video codec is still present, and so this layer can deliver the same functionality as today.

Mobile environment

If I get time, I’ll add a section where I discuss whether this is feasible on smartphones etc. E.g., if the uncanny valley is bridged, an avatar may upscale the mobile video from low to high resolution based on information in the visual library.

Cooperation opportunities

If I get time, I’ll add a section where I discuss e.g. whiteboarding with avatars.

Conclusion: Benefits

To conclude, assuming that an avatar representation can be realistic while requiring substantially less transfer of data than with today’s video streams, we reap the following benefits:

  • Reduced bandwidth & latency. Implied is that network latency is significantly greater than compression & rendering latency.
  • Higher image quality.
  • Eye to eye contact.
  • Better video performance in terms of brightness, contrast and white balance.
  • Better spatial arrangement. Note also that people that are not collocated can be shown in the same environment, and that the background may be removed altogether, depending on sophistication of pattern recognition technology.
  • And, possibly to be further explained in an update: Improved multicast, mobile videoconferencing, and collaboration.

About Bjørn Solnørdal Tennøe

Interaction designer, happy camper & biker, proud father to three krazy kids.
This entry was posted in Future developments. Bookmark the permalink.