To better understand how VLMs process visual information, we conduct a series of controlled experiments in the VTQA (Vision-text input) setting. For each sample, we replace the original image input (that matches the textual description) with either: (1) No Image: only keep the textual input without the image input, (2) Noise Image: a Gaussian noise image irrelevant to the task, and (3) Random Image: a random image from the dataset that does not match the textual description.