AVILA: Alt-text via Vision-Language AI

Beyond "This Image May Contain..."

This project explores uses of cutting-edge AI vision-language models (VLMs) to generate "alt text" image descriptions for digital collections at scale, as well as further potential applications of such technologies, including providing free-text “evocative” search in multiple languages, object detection, and other methods for improving discovery within image collections.

Peter Broadwell, Manager of AI Modeling & Inference, Research Data Services, Stanford University Libraries Lindsay King, Head Librarian, Art and Architecture Library, Stanford University Libraries

Capabilities of VLMs

Some vision-language models can generate image captions based not only on the visual contents of the images but also through "conditioning" via accompanying free-text descriptions of the images and specific instructions regarding desired aspects of the captions. These instructions are also known as "prompts" - a concept that should be familiar to anyone who has conversed with a large language model chatbot. This raises two promising possibilities:

Minimal to no fine-tuning needed: AI models are "taught" how to complete tasks like image captioning by showing them many examples of the task being completed successfully. But models that already have undergone quite a lot of such foundational instruction (often referred to as "pre-training") may need further "fine-tuning" on specific examples to gain facility with the task in a certain domain; e.g., a model that can generate good long-form captions may need further fine-tuning to generate concise alt text. Fine-tuning, however, can be a time-consuming and computing-intensive task, and sufficient training examples may not be readily available in some specialized domains. A vision-language model that responds properly to instructions provided via the text prompt may not need such expensive fine-tuning.
Outputs can be conditioned on context at the per-item level: Because some vision-language AI models can "condition" their resulting image descriptions based on a different text prompt for each image, it is possible to incorporate any available metadata, however partial, when generating alt text for an image, and perhaps even text that is proximate on the page to where the image is to appear.

The experiments

Potential alt-text descriptions were generated for significant portions of several digital image collections covering a range of subjects and formats. The following pages display the results of using Qwen2.5-VL, a powerful vision-language model, to produce both 1) "unprompted" alt-text image descriptions solely on the basis of a generic task description (the "system prompt") = 🤖, and 2) descriptions that are additionally "conditioned" on available human-provided (👤) metadata or image descriptions = 🤖🤝👤.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
metadata		metadata
prompts		prompts
retrievers		retrievers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
apple_mlx_conda_env.yml		apple_mlx_conda_env.yml
apple_mlx_requirements.txt		apple_mlx_requirements.txt
cuda_conda_env.yml		cuda_conda_env.yml
cuda_requirements.txt		cuda_requirements.txt
generate_alt_text_Gemma-3.ipynb		generate_alt_text_Gemma-3.ipynb
generate_alt_text_Qwen2.5-VL.ipynb		generate_alt_text_Qwen2.5-VL.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AVILA: Alt-text via Vision-Language AI

Beyond "This Image May Contain..."

Capabilities of VLMs

The experiments

About

Releases

Packages

Contributors 2

Languages

License

sul-rds/avila

Folders and files

Latest commit

History

Repository files navigation

AVILA: Alt-text via Vision-Language AI

Beyond "This Image May Contain..."

Capabilities of VLMs

The experiments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages