# Microsoft's New 'PROJECT RUMI' Takes Everyone By SURPRISE! (Now Announced!)

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=W-lbmiX2JK0
- **Дата:** 10.08.2023
- **Длительность:** 9:01
- **Просмотры:** 21,735
- **Источник:** https://ekstraktznaniy.ru/video/14745

## Описание

Introduction 00:00
How It Works 01:34
Diagram 03:21
Non Contact Sensors 05:21
Visual Demo 06:29
Transformers 07:02

https://www.microsoft.com/en-us/research/project/project-rumi/

Welcome to our channel where we bring you the latest breakthroughs in AI. From deep learning to robotics, we cover it all. Our videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on our latest videos.

Was there anything we missed?

(For Business Enquiries)  contact@theaigrid.com

#LLM #Largelanguagemodel #chatgpt
#AI
#ArtificialIntelligence
#MachineLearning
#DeepLearning
#NeuralNetworks
#Robotics
#DataScience
#IntelligentSystems
#Automation
#TechInnovation

## Транскрипт

### Introduction []

Microsoft has recently unveiled an Innovative breakthrough in the realm of artificial intelligence setting the stage for a transformation in how we engage with large language models like chat GPT this pioneering Endeavor is termed project roomy this is not just another incremental step in AI advancements rather it's a Leap Forward differentiating itself from prior efforts by Leading AI research teams until now interacting with AI language models has largely been a text-based experience we input questions or prompts and the AI responds based on its training but what if an AI could go beyond just understanding The Words you type what if it could sense the underlying emotions behind those words enter project roomy the core idea behind this initiative is multimodal paralinguistic prompting at a glance this might sound like a complex Tech jargon but it can be distilled into a simpler concept enabling large language models to not only process textual information but to also gauge and respond to the user's emotions this means that when you communicate with such a model it would not only understand the words but also the sentiment and feelings behind them so if you go over to Microsoft's research page they do a lot of explaining to show how it works they also include a video in which they demonstrate this new software they State large language models are great but llms also have limitations they may not always understand the context and

### How It Works [1:34]

nuances of a conversation their performance also depends on the quality and specificity of the user's input or prompt the data that the user inputs into the llm is a lexical entry which does not comprehensively represent the nuances of human to human interaction it is in fact missing all the paralinguistic information intonation gestures facial expressions and everything besides the actual words that contribute to the meaning and intentions of the speaker this can lead to misinterpretation misunderstanding or inappropriate responses from the llm project Rumi incorporates paralinguistic input into prompt-based interactions with llms with the objective of improving the quality of communication providing this context is critical to enhancing llm's capabilities in this AI as a copilot era our current system leverages separately trained vision and audio-based models to detect and analyze non-verbal cues extracted from data streams the models assess sentiment from cognitive and physiological data in real time generating appropriate paralinguistic tokens to augment standard lexical prompt input to existing llms such as gpt4 this multimodal Muti step architecture integrates seamlessly with all pre-trained text-based llms to provide additional information on the user's sentiment and intention that is not captured by text-based models augmenting The Prompt with the richness and subtlety of human communication to bring human AI interaction to a new level The Illustrated diagram that accompanies the

### Diagram [3:21]

description offers a comprehensive view of how a user interacts with the advanced system this visual representation breaks down the various inputs into two primary categories physical sensors and non-contact systems physical sensors these are tactile devices that come in direct contact with the user to gather real-time physiological data the diagram showcases three primary types EEG electroencephalogram this sensor measures and Records the electrical activity of the brain it provides valuable insight lights into a user's cognitive processes alertness and mental state allowing the system to gauge the user's concentration relaxation or potential cognitive workload perspiration sensors also known as galvanic skin response GSR sensors these measure the electrical conductance of the skin which varies with its moisture level as sweat production can be an indicator of emotional or physiological arousal these sensors offer cues about the user's emotional state be it stress excitement or fear heart rate monitor as the name suggests this device measures the user's heart rate fluctuations in heart rate can indicate a range of emotions or reactions from relaxation and calmness to anxiety or excitement non-contact systems these are systems that gather data without physically touching the user relying on Visual and auditory cues instead the diagram details three primary systems camera this system captures visual information analyzing facial expressions and micro Expressions these Expressions can reveal a plethora of emotions from happiness and surprise to sadness and anger enabling the system to deduce the user's emotional state accurately eye tracking this Advanced system follows the user's eye movements and

### Non Contact Sensors [5:21]

gaze Direction by doing so it can infer elements like Focus interest and even the emotional response to what the user is viewing for instance rapid eye movement might suggest nervousness or excitement while a fixed gaze might indicate concentration or deep thought speech analysis by analyzing the user's voice this system can detect variations in tone pitch and speed which can be indicative of emotions like happiness frustration uncertainty or confidence together this combination of physical sensors and non-contact systems paints a holistic picture of the user's emotional and cognitive State allowing the system to tailor its responses and interactions accordingly this Fusion of Technology offers a promising leap towards a more intuitive and empathetic human AI interaction so Microsoft also includes a video in which they demonstrate how effective this software is they have integrated it into what seems to be a beta version of Bing chat so from this first screenshot you can see this is where the user records a

### Visual Demo [6:29]

video or audio that is then transcribed and input into the AI as text then the data in terms of what emotion the user is feeling is displayed as a pie chart you can see that some of this user's emotion were happy there is also apparently some discussed in the audio which the AI has picked up on and in the video the AI describes the state of the user as neutral further on in the video they describe how they break down the audio and realize what is being said the audio is broken down into two parts

### Transformers [7:02]

first the text and then the features they use two Transformers to do this the first being Hubert the Hubert Transformer is a self-supervised speech representation learning model inspired by The Bert architecture which is originally designed for natural language processing tasks in simple terms Hubert is used to transform raw speech data into a more language-like structure by discovering discrete hidden units which can be compared to words or tokens in a text sentence the main goal of Hubert is to improve performance on various Downstream tasks such as speech recognition generation and compression it achieves this by consuming masked continuous speech features and predicting predetermined cluster assignments the model architecture of Hubert is similar to wav-2 VC 2. 0 with base and large versions having 95 million and 317 million parameters respectively they also use the Transformer distalbert is a smaller faster and lighter version of The Bert Transformer model designed for natural language processing tasks it is created using a technique called knowledge distillation which reduces the size of the original Bert model by 40 percent while retaining 97 percent of its performance this makes distilbert more efficient and suitable for on-device applications and situations with limited computational resources distilbert can be used for various natural language processing tasks such as text classification overall project roomy is quite surprising we are so used to just chatting with large language models but now they may be able to gauge how we feel and think which means there will be a much deeper level of understanding which overall should improve the responses
