MiniGPT-4 is an AI model that is designed to improve vision-language understanding. It is based on the fact that large language models like gpt-4 have excellent multi-modal generation capabilities. MiniGPT-4 uses a frozen visual encoder alongside the frozen Vicuna large language model, and a single projection layer to align them. This model is capable of many tasks, such as creating detailed image descriptions, generating websites from hand-written drafts, writing stories or poems based on images, providing solutions to problems shown in images, and even teaching users how to cook with food photos. The architecture of MiniGPT-4 includes a vision encoder pretrained with VIT Q-Former, a single linear projection layer, and the Vicuna large language model. The linear layer has to be trained in order to align visual features with Vicuna. The model is computationally efficient, with only 5 million aligned image-text pairs necessary for training the projection layer.

