speech-to-text-semisupervised#

Pseudolabel Nusantara Audiobook using Whisper Large V3#

Nusantara Audiobook#

This directory to gather semisupervised transcribed on Malay audiobook.

All the videos, songs, images, and graphics used in the video belong to their respective owners and I does not claim any right over them.

Copyright Disclaimer under section 107 of the Copyright Act of 1976, allowance is made for "fair use" for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research. Fair use is a use permitted by copyright statute that might otherwise be infringing.

Download#

  1. https://f000.backblazeb2.com/file/malaya-speech-model/data/dari-pasentran-ke-istana.gz

  1. https://f000.backblazeb2.com/file/malaya-speech-model/data/turki.gz

  1. https://f000.backblazeb2.com/file/malaya-speech-model/data/salina.gz

  1. Text only, https://f000.backblazeb2.com/file/malaya-speech-model/data/text-audiobook.tar.gz

  2. Test set, https://f000.backblazeb2.com/file/malaya-speech-model/data/testset-audiobook.tar.gz

  3. Train set with augmentation, https://f000.backblazeb2.com/file/malaya-speech-model/data/trainset-audiobook.tar.gz

  4. https://f000.backblazeb2.com/file/malaya-speech-model/data/salina-supervised-sani.tar.gz

  1. https://f000.backblazeb2.com/file/malaya-speech-model/data/dari-pasentran-ke-istana-supervised-sani.tar.gz

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Audiobook,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-audiobook}}
}

distributed multi-GPUs pseudolabel using Whisper on Malaya-Speech STT#

This pseudolabel included fast hashing load audio files and continue last step decoded.

how-to#

  1. Generate chunks hash map, generate-global-indices.ipynb.

Use torchrun#

NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 ~/.local/bin/torchrun --nproc_per_node 2 \
-m run \
--indices_filename=indices-crawl-malaya-speech.json --batch_size=16

NCCL is not required.

distributed multi-GPUs pseudolabel using Whisper on Malaysian Youtube videos#

This pseudolabel included fast hashing load audio files and continue last step decoded.

how-to#

  1. Prepare chunks hash map, prepare-indices-chunks.ipynb.

  2. Generate chunks hash map, generate-global-indices.ipynb.

Use accelerate#

  1. Configure accelerate,

accelerate config
  1. Run accelerate,

~/my-env/bin/accelerate launch run.py --indices_filename=global-indices.json --batch_size=4

Use torchrun#

NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 ~/my-env/bin/torchrun --nproc_per_node 2 \
-m run \
--indices_filename=global-indices.json --batch_size=4

NCCL is not required.

Run in 4x A100#

We use batch size of 52,

NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 torchrun --nproc_per_node 4 \
-m run \
--indices_filename=crawl-youtube-global-indices.json --batch_size=52

Predict language using Speechbrain#

NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 torchrun --nproc_per_node 4 \
-m run-predict-lang \
--batch_size=32

Noisy Audiobook#

All the videos, songs, images, and graphics used in the video belong to their respective owners and I does not claim any right over them.

Copyright Disclaimer under section 107 of the Copyright Act of 1976, allowance is made for "fair use" for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research. Fair use is a use permitted by copyright statute that might otherwise be infringing.