LiNKD: Leveraging knowledge distillation for lip reading on limited near-infrared data

Samar Daou1, Achraf Ben-Hamadou1,2, Ahmed Rekik1,2, Abdelaziz Kallel1,2
1SMARTS: Laboratory of Signals, systeMs, aRtificial Intelligence and neTworkS, 2Digital Research Centre of Sfax, Tunisia
Interpolate start reference image. Interpolate start reference image.

Samples from the LR-CAR dataset. The top row shows RGB images and the bottom row displays near-infrared images.

Abstract

In scenarios involving driver-car interaction, effective communication is vital for maintaining safety and operational efficiency. Conventional communication methods, like voice commands and manual inputs, can be impractical and unsafe, especially in noisy or dynamic environments. Lipreading offers a compelling alternative by utilizing visual cues from lip movements and facial expressions. However, existing lipreading techniques typically depend on RGB cameras, which can struggle with the variable lighting conditions found in vehicle interiors, making near-infrared imaging a more suitable option. The scarcity of near-infrared data presents challenges, as merely fine-tuning RGB models is inadequate. To overcome this limitation, we propose a knowledge distillation approach that transfers features from pre-trained RGB models to near-infrared models, thereby enhancing performance with the limited available near-infrared data. Additionally, we introduce LR-CAR, the first dual-modality dataset for driver-car interaction which includes both RGB and near-infrared modalities. Our results indicate that this method significantly boosts lipreading performance, achieving an impressive 26.51% improvement over basic fine-tuning.

Method overview

Overview of the EWC method across two training phases: In Phase 1, the model is trained on RGB data. Key parameters are identified using the Fisher Information matrix. In Phase 2, the model is trained on near-infrared data. A regularization term is applied to protect critical parameters from Phase 1. This approach allows the model to retain RGB knowledge while adapting to near-infrared input.

LR-CAR Lipreading Dataset

Data processing was done via an automated pipeline:

LR-CAR dataset was recorded from two different cameras in the same time: RGB and Near-Infrared camera.

29 speakers were involved to obtain a global number of 1044 utterances.

Each speaker repeated 12 representative car commands three times.

ID Commands ID Commands
1 Time to arrival 7 I need a break
2 Cooler 8 Take me home
3 Warmer 9 Hello
4 Mute 10 Take me to work
5 Weather forecast 11 Accept call
6 I feel fine 12 Reject call