Audio-visual speech recognition based on DCM

this repo is implementing AVSR task in Fairseq==0.8.0 toolkit.

The dependencies are noticed in conda_env.yml file.
Arguments about train or inference same with speech_recognition example in the original Fairseq toolkit.
The model is composed about three blocks. 1) self-attention transformer based modality encoder, 2) dual-cross modality attention layer and 3) transformer based attention decoder.
The mel-filterbank audio features and pre-trained CNN video features are fed in the model, then the model creates character-based sentence.
WER and CER calculated by Sclite package using prediction and ground-truth sentences.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
build		build
docs		docs
examples/audio_visual_speech_recognition		examples/audio_visual_speech_recognition
fairseq.egg-info		fairseq.egg-info
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
conda_env.yml		conda_env.yml
generate.py		generate.py
hubconf.py		hubconf.py
interactive.py		interactive.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
train.py		train.py

Provide feedback