Recently, I often hear the word multimodal AI.I've heard the word modal somewhere.Modal (modal) is an adjective meaning "mode".If you're an engineer, you'll think of a modal window that is forcibly displayed on a parent window.In fact, this means "waiting mode window", which means that if you do not close it, you will not be able to operate the parent window.
Multi -modal AI is AI that integrates multiple (multi) data.Humans originally obtain information from multi, and judge.For example, table tennis not only "see the ball" that the opponent hit, but also "listen to the sound" on the racket, predict which course will come and shake the racket.I saw it in a TV experiment before, but I was surprised that even top players would empty when the sound was shifted.
It is a multi -modal AI to judge multiple input information like humans.Until now, it was an individual processing technology such as CNN (folding neural network) for visual (image), and rnn (recursion NN) for hearing (audio recognition).We are trying to make this a multi -modal and overlap multiple information to evolve into AI that makes a more advanced judgment.