Skip to content

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

License

Notifications You must be signed in to change notification settings

gregor-ge/Centurio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

arXiv Hugging Face

Release

  • [2025/02/04] We uploaded our Synthdog data to HuggingFace: Dataset
  • [2025/01/10] We released Centurio with model checkpoints and code for training & testing. Data will follow soon.

Installation

Standalone (with HuggingFace transformers library)

The model can be used directly through the transformers library with our custom code. Check out the model cards of our checkpoints in the Centurio Collection on HuggingFace for more details.

Example Code
from transformers import AutoModelForCausalLM, AutoProcessor
import timm
from PIL import Image    
import requests

url = "https://upload.wikimedia.org/wikipedia/commons/b/bd/Golden_Retriever_Dukedestiny01_drvd.jpg"
image = Image.open(requests.get(url, stream=True).raw)

model_name = "WueNLP/centurio_qwen"

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

## Appearance of images in the prompt are indicates with '<image_placeholder>'!
prompt = "<image_placeholder>\nBriefly describe the image in German."

messages = [
    {"role": "system", "content": "You are a helpful assistant."},  # This is the system prompt used during our training.
    {"role": "user", "content": prompt}
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True
)

model_inputs = processor(text=[text], images=[image] return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

With Trident (for training or evaluation)

We use the trident, a modular framework by Fabian Schmidt that combines pytorch-lightning with hydra configs.

pip install -r requirements.txt 
pip install git+https://github.com/fdschmidt93/trident.git

Primer on trident: trident in 20 minutes

tl;dr: We compose a hierarchy of configs (/configs) to define an experiment config (experiment) that defines
(1) which datasets to use for training and testing and for the latter which metrics to use (dataspec and dataspecs),
(2) which model to use (module), and
(3) other things like optimizer, logging, checkpointing
for a PyTorch Lightning run using our code written in /src.

Training

Below is an example showing you how to use the experiment configs (mblipv2_train.yaml and mblipv2_pretrain.yaml). Trident allows us to overwrite (nearly) all parameters specified in the configs, which we use to specify various parameters like the LLM, learning rate, etc.

For an example on how to structure the data json files, see the examples in /data.

CLI Command
python -u -m trident.run experiment=mblipv2_train \
  run.train_data=/p/data/jsons \ # prefix path for all json
  run.image_root=/p/data/images \
  run.train_file=multilingual/combination/mblipv2_instruct_base_en.json \
  hydra.run.dir=$output \  # the output folder
  ++run.llm=microsoft/Phi-3.5-mini-instruct \
  ++run.vit_model=vit_so400m_patch14_siglip_384 \
  ++run.train_padding_side="left" \
  module.model.adapter_type="mlp" \
  module.model.load_4bit=False \
  module.model.use_flash_attn=True \
  module.model.use_lora=True \
  module.optimizer.lr=0.0001 \
  module.optimizer.weight_decay=0.0 \
  module.model.lora_r=256 module.model.lora_alpha=512 \
  ++run.max_seq_len=1024 \
  run.test_batch_size=2 run.test_num_workers=2 \
  run.train_batch_size=2 run.train_num_workers=6 \
  trainer.devices=$NUM_GPUS \  # single or multi-gpu both works out of the box
  trainer.accumulate_grad_batches=$ACCUM \
  ++run.seed=4242 \
  trainer.val_check_interval=0.25 \
  ++trainer.strategy="ddp_find_unused_parameters_true" \  # was needed for Phi 3.5 to work. Other LLMs can remove this and use the default Deepspeed Stage 2 config.
  '++logger.wandb.tags=[training,english_only]' \

To use the image tiling approach used for Centurio replace

module.model.adapter_type="mlp" \

with

  ++run.multi_scale=2 \
  module.model.adapter_type="multiscale-pool" \
  ++module.model.adapter_config.multi_scale=2 \

Evaluation

Below is an example for evaluating a model trained with the above training script on a downstream task (MAXM in this case) by loading the checkpoint from DeepSpeed (which contains the MLP weights) and the PEFT adapter checkpoint:

CLI Command
python -u -m trident.run experiment=mblipv2_test_maxm \
  run.train_data=/p/data/jsons \ # prefix path for all json
  run.xm3600_image_root=/p/data/images/maxm \
  hydra.run.dir=$output \
  ++module.model.train_checkpoint=/checkpoints/12_08_2024_09_58_16/checkpoints/0-24250.ckpt/checkpoint/mp_rank_00_model_states.pt \
  ++module.model.lora_checkpoint=/checkpoints/12_08_2024_09_58_16/checkpoints/0-24250 \
  ++run.llm=meta-llama/Meta-Llama-3-8B-Instruct \
  ++run.vit_model=vit_so400m_patch14_siglip_384 \
  ++run.train_padding_side="left" \
  module.model.adapter_type="mlp" \
  module.model.load_4bit=False \
  module.model.use_flash_attn=True \
  module.model.use_lora=True \
  run.test_batch_size=2 run.test_num_workers=16 \
  trainer.devices=1 \ # multi-GPU is not supported
  '++logger.wandb.tags=[eval,maxm]'

Citation

@article{centurio2025,
  author       = {Gregor Geigle and
                  Florian Schneider and
                  Carolin Holtermann and
                  Chris Biemann and
                  Radu Timofte and
                  Anne Lauscher and
                  Goran Glava\v{s}},
  title        = {Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model},
  journal      = {arXiv},
  volume       = {abs/2501.05122},
  year         = {2025},
  url          = {https://arxiv.org/abs/2501.05122},
  eprinttype    = {arXiv},
  eprint       = {2501.05122},
}

About

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published