Multimodal architectures such as LLaVA, Kosmos, or AnyMAL have been gaining traction recently and have demonstrated their capabilities in practice. These models tokenize data from modalities other than text, such as images, and use external modality-specific encoders to embed them into joint linguistic space. This allows architectures to provide a means to instruct tune multi-modal data mixed with the text in an interleaved fashion.

Authors of this paper propose that this generic architectural preference can be extended into a much more ambitious setting in the near future, which they refer to as an “omni-modal era”. Notions of “entities”, which are somehow connected to the concept of NER, can be imagined as modalities for these types of architectures.

For instance, current LLMs are known to struggle to deduce full algebraic reasoning. Though research is going on to develop “math-friendly” specific models or use external tools, one particular horizon for this problem might be to define quantitative values as a modality in this framework. Another example would be implicit and explicit date and time entities which can be processed by a specific temporally-cognitive modality encoder.

LLMs are having a very difficult time also on geospatial understanding as well, where they are far from being considered “geospatially aware”. In addition, numerical global coordinates are needed to be processed accordingly, where notions of proximity and adjacency should be accurately reflected in the linguistic embedding space. Therefore, incorporating locations as a special geospatial modality could also provide a solution to this problem with specifically designed encoder and joint training. In addition to these examples, the first potential entities that could be incorporated as a modality come to mind are people, institutions, etc.

The authors argue this type of approach promises to solve parametric/non-parametric knowledge scaling and context length limitation, as the complexity and information can be distributed to numerous modality encoders. This might also solve the problems of injecting updated information via modalities. Researchers just provide the boundaries of such a potential framework and discuss the promises and challenges of developing an entity-driven language model.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Researchers from Datategy and Math & AI Institute Offer a Perspective for the Future of Multi-Modality of Large Language Models appeared first on MarkTechPost.

Researchers from Datategy SAS in France and Math & AI Institute in Turkey propose one potential direction for the recently emerging multi-modal architectures. The central idea of their study is that well-studied Named Entity Recognition (NER) formulation can be incorporated into a many-modal Large Language Model (LLM) setting. Multimodal architectures such as LLaVA, Kosmos, or
The post Researchers from Datategy and Math & AI Institute Offer a Perspective for the Future of Multi-Modality of Large Language Models appeared first on MarkTechPost. Read More AI Shorts, Artificial Intelligence, Editors Pick, Machine Learning, Staff, Tech News, Technology, Uncategorized

Researchers from Datategy and Math & AI Institute Offer a Perspective for the Future of Multi-Modality of Large Language Models Asif Razzaq Artificial Intelligence Category – MarkTechPost

Leave a Reply Cancel reply