While VLMs are strong at understanding both text and images, they often rely solely on text when reasoning, limiting their ability to solve tasks that require visual thinking, such as spatial puzzles. People naturally visualize solutions rather than describing every detail, but VLMs struggle to do the same. Although some recent models can generate both text and images, training them for image generation often weakens their ability to reason. Producing images also doesn’t support step-by-step visual reasoning. As a result, unlocking the full potential of VLMs for complex, visually grounded thinking remains a key challenge in the field. CoT prompting…
Read More