translation#

FLORES-200 Evaluation set#

IIUM-Confession#

Translate using https://github.com/Songkeys/Translateer

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Google Translate IIUM Confession,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/translation/iium-confession}}
}

LASER for eng_Latn-zsm_Latn#

Original dataset at https://github.com/facebookresearch/LASER/tree/main/data/nllb200#data

Update, AllenNLP released NLLB dataset at https://huggingface.co/datasets/allenai/nllb, https://storage.googleapis.com/allennlp-data-bucket/nllb/eng_Latn-zsm_Latn.gz.

how-to#

  1. Install dependencies,

sudo apt-get install libcurl4-openssl-dev
git clone https://github.com/kpuatfb/preprocess.git
cd preprocess
git checkout wet
mkdir build
cd build
git clone https://github.com/Cyan4973/xxHash
mkdir xxHash/build
cd xxHash/build
cmake ../cmake_unofficial
cmake --build .
cd ../..
cmake ..
make -j4
git clone https://github.com/facebookresearch/LASER.git
cd LASER/utils
pip3 install -e . --user
cd ../..
  1. Download dataset from https://github.com/facebookresearch/LASER/tree/main/data/nllb200#data, i choose eng_Latn-zsm_Latn.

  2. Run LASER,

xzcat eng_Latn-zsm_Latn.meta.v1.xz | egrep ^crawl-data | ~/preprocess/build/bin/wet_lines | python3 ~/preprocess/build/LASER/utils/src/cleaner_splitter.py > eng_Latn-zsm_Latn

how-to distribute#

Required redis.

  1. filter metadata,

xzcat eng_Latn-zsm_Latn.meta.v1.xz | egrep ^crawl-data > eng_Latn-zsm_Latn.meta
mkdir splitted
cd splitted
split -l 1000000 -d --additional-suffix=.split ../eng_Latn-zsm_Latn.meta eng_Latn-zsm_Latn.meta
  1. Groupby sha1 and parapgraphs, gather-warc.ipynb.

  2. Split JSONL,

mkdir splitted-jsonl
cd splitted-jsonl
split -l 200000 -d --additional-suffix=.split ../warcs-eng_Latn-zsm_Latn.jsonl warcs-eng_Latn-zsm_Latn.jsonl
  1. Run distribute-laser-nllb200.ipynb for each splitted files.

Citation#

@article{DBLP:journals/corr/abs-1812-10464,
author    = {Mikel Artetxe and
Holger Schwenk},
title     = {Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual
Transfer and Beyond},
journal   = {CoRR},
volume    = {abs/1812.10464},
year      = {2018},
url       = {http://arxiv.org/abs/1812.10464},
eprinttype = {arXiv},
eprint    = {1812.10464},
timestamp = {Wed, 02 Jan 2019 14:40:18 +0100},
biburl    = {https://dblp.org/rec/journals/corr/abs-1812-10464.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

Local EN to MS Subtitles#

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Local EN to MS Subtitles,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/translation/local-movies-subtitle}}
}

Google Translate EN to MS for longer texts#

how-to#

prefix, https://f000.backblazeb2.com/file/malay-dataset/translation/long-text/

  1. long-text-0.json.translated.json

  2. long-text-100000.json.translated.json

  3. long-text-1000000.json.translated.json

  4. long-text-200000.json.translated.json

  5. long-text-300000.json.translated.json

  6. long-text-400000.json.translated.json

  7. long-text-500000.json.translated.json

  8. long-text-600000.json.translated.json

  9. long-text-700000.json.translated.json

  10. long-text-800000.json.translated.json

  11. long-text-900000.json.translated.json

  12. long-text-1100000.json.translated.json

  13. long-text-1200000.json.translated.json

  14. long-text-1300000.json.translated.json

  15. long-text-1400000.json.translated.json

  16. long-text-1500000.json.translated.json

  17. long-text-1600000.json.translated.json

Alignment#

Prepare alignment for MS-EN using eflomal.

how-to#

  1. Build https://github.com/robertostling/eflomal,

make
make INSTALLDIR=~/.local/bin install
python3 setup.py install --user
  1. Align MS-EN text, prepare-ms-en-fwd-rev.ipynb.

NLLB EN-MS#

Original page, https://github.com/facebookresearch/LASER/tree/main/data/nllb200

Apply filter on NLLB eng_Latn-zsm_Latn NLLB pair dataset.

Citation#

@misc{https://doi.org/10.48550/arxiv.2207.04672,
doi = {10.48550/ARXIV.2207.04672},

url = {https://arxiv.org/abs/2207.04672},

author = {{NLLB Team} and Costa-jussà, Marta R. and Cross, James and Çelebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti, Prangthip and Hoffman, John and Jarrett, Semarley and Sadagopan, Kaushik Ram and Rowe, Dirk and Spruit, Shannon and Tran, Chau and Andrews, Pierre and Ayan, Necip Fazil and Bhosale, Shruti and Edunov, Sergey and Fan, Angela and Gao, Cynthia and Goswami, Vedanuj and Guzmán, Francisco and Koehn, Philipp and Mourachko, Alexandre and Ropers, Christophe and Saleem, Safiyyah and Schwenk, Holger and Wang, Jeff},

keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7, 68T50},

title = {No Language Left Behind: Scaling Human-Centered Machine Translation},

publisher = {arXiv},

year = {2022},

copyright = {Creative Commons Attribution Share Alike 4.0 International}
}

Noisy EN-MS augmentation#

Augment EN-MS dataset.

Noisy MS-EN augmentation#

Augment MS-EN dataset.

OPUS#

Citation#

@InProceedings{TIEDEMANN12.463,
author = {Jörg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}