Beyond Text: Toward Extended Large Language Models

While current large language models focus primarily on text, the underlying predictive principles are far more general and apply to any data that can be sequentially represented. This generality opens a natural path toward extended language models capable of modeling multiple modalities. 

Modern LLMs are, at their core, sequential prediction systems: given a sequence of tokens, they model the conditional distribution of the next token. This predictive paradigm has deep parallels with ideas long developed in image and video coding, where predictive models exploit spatial and temporal redundancies to efficiently represent visual data. The Multicom Lab has decades of foundational work in this area, and these prior results directly inform our approach to extending predictive models beyond text. 

This observation motivates the development of extended large language models capable of modeling multiple data modalities. 

A key idea is that images and videos can also be represented as sequences of tokens, provided that suitable tokenization schemes are designed. Inspired by techniques in image and video coding, visual data can be tokenized using blocks or structured regions, enabling prediction models to capture spatial and temporal dependencies. 

With appropriate tokenization strategies: 

  • images become sequences of visual tokens, 

  • videos become sequences of spatiotemporal tokens, and 

  • predictive models can learn contextual relationships among visual elements. 

This perspective opens the possibility of unified predictive models that handle text, images, and video within a common information-theoretic framework. It also raises new scientific questions about the nature of visual semantics—whether learned embeddings of visual structures carry contextual meaning analogous to word embeddings in language and how such meaning can be characterized and measured.