translation
Contents
translation#
ChatGPT 3.5 Noisy Translation b.cari.com.my#
ChatGPT 3.5 Noisy Translation c.cari.com.my#
ChatGPT 3.5 Noisy Translation Facebook#
ChatGPT 3.5 Noisy Translation IIUM Confession#
ChatGPT 3.5 Noisy Translation Manglish#
ChatGPT 3.5 NLLB Banjarese#
ChatGPT 3.5 Noisy Translation Twitter#
ChatGPT 4 Noisy Translation Twitter to local dialect#
Alignment#
Prepare alignment for EN-MS using eflomal.
Download#
EN, https://f000.backblazeb2.com/file/malay-dataset/translation/en-ms-alignment/EN
MS, https://f000.backblazeb2.com/file/malay-dataset/translation/en-ms-alignment/MS
fwd, https://f000.backblazeb2.com/file/malay-dataset/translation/en-ms-alignment/fwd
rev, https://f000.backblazeb2.com/file/malay-dataset/translation/en-ms-alignment/rev
align.priors, https://f000.backblazeb2.com/file/malay-dataset/translation/en-ms-alignment/align.priors
how-to#
make
make INSTALLDIR=~/.local/bin install
python3 setup.py install --user
Align EN-MS text,
python3 test.py
FLORES-200 Evaluation set#
Google Translate EN to MS#
Translate using https://github.com/Songkeys/Translateer
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-english-news/tree/main
Facebook#
Translate using https://github.com/Songkeys/Translateer
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-malaysian-facebook
Google Translate filtered Common Crawl#
Translate using https://github.com/Songkeys/Translateer
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-filtered-common-crawl
Google Translate Malaysia Parliament#
Translate using https://github.com/Songkeys/Translateer
download#
Google Translate Malay News#
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-IIUM-confession/tree/main
Google Translate Malay News#
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-malay-news/tree/main
Google Translate Malaysian PDF#
Google Translate MS-ID#
Translate using https://github.com/Songkeys/Translateer
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-ms-id
Google Translate MS-JW#
Translate using https://github.com/Songkeys/Translateer
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-ms-jw
Google Translate MS-PA#
Translate using https://github.com/Songkeys/Translateer
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-ms-pa
Google Translate MS-TA#
Translate using https://github.com/Songkeys/Translateer
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-ms-ta
Twitter#
Translate using https://github.com/Songkeys/Translateer
download#
Full list at https://huggingface.co/datasets/mesolitica/google-translate-malaysian-twitter/tree/main
IIUM-Confession#
Translate using https://github.com/Songkeys/Translateer
download#
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Google Translate IIUM Confession,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/translation/iium-confession}}
}
LASER for eng_Latn-zsm_Latn#
Original dataset at https://github.com/facebookresearch/LASER/tree/main/data/nllb200#data
Update, AllenNLP released NLLB dataset at https://huggingface.co/datasets/allenai/nllb, https://storage.googleapis.com/allennlp-data-bucket/nllb/eng_Latn-zsm_Latn.gz.
how-to#
Install dependencies,
sudo apt-get install libcurl4-openssl-dev
git clone https://github.com/kpuatfb/preprocess.git
cd preprocess
git checkout wet
mkdir build
cd build
git clone https://github.com/Cyan4973/xxHash
mkdir xxHash/build
cd xxHash/build
cmake ../cmake_unofficial
cmake --build .
cd ../..
cmake ..
make -j4
git clone https://github.com/facebookresearch/LASER.git
cd LASER/utils
pip3 install -e . --user
cd ../..
Download dataset from https://github.com/facebookresearch/LASER/tree/main/data/nllb200#data, i choose eng_Latn-zsm_Latn.
Run LASER,
xzcat eng_Latn-zsm_Latn.meta.v1.xz | egrep ^crawl-data | ~/preprocess/build/bin/wet_lines | python3 ~/preprocess/build/LASER/utils/src/cleaner_splitter.py > eng_Latn-zsm_Latn
how-to distribute#
Required redis.
filter metadata,
xzcat eng_Latn-zsm_Latn.meta.v1.xz | egrep ^crawl-data > eng_Latn-zsm_Latn.meta
mkdir splitted
cd splitted
split -l 1000000 -d --additional-suffix=.split ../eng_Latn-zsm_Latn.meta eng_Latn-zsm_Latn.meta
Groupby sha1 and parapgraphs, gather-warc.ipynb.
Split JSONL,
mkdir splitted-jsonl
cd splitted-jsonl
split -l 200000 -d --additional-suffix=.split ../warcs-eng_Latn-zsm_Latn.jsonl warcs-eng_Latn-zsm_Latn.jsonl
Run distribute-laser-nllb200.ipynb for each splitted files.
download#
Filtered if laser score >= 1.07, prepare-eng_Latn-zsm_Latn.ipynb.
Citation#
@article{DBLP:journals/corr/abs-1812-10464,
author = {Mikel Artetxe and
Holger Schwenk},
title = {Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual
Transfer and Beyond},
journal = {CoRR},
volume = {abs/1812.10464},
year = {2018},
url = {http://arxiv.org/abs/1812.10464},
eprinttype = {arXiv},
eprint = {1812.10464},
timestamp = {Wed, 02 Jan 2019 14:40:18 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1812-10464.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Local EN to MS Subtitles#
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Local EN to MS Subtitles,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/translation/local-movies-subtitle}}
}
Google Translate EN to MS for longer texts#
how-to#
prefix, https://f000.backblazeb2.com/file/malay-dataset/translation/long-text/
long-text-0.json.translated.json
long-text-100000.json.translated.json
long-text-1000000.json.translated.json
long-text-200000.json.translated.json
long-text-300000.json.translated.json
long-text-400000.json.translated.json
long-text-500000.json.translated.json
long-text-600000.json.translated.json
long-text-700000.json.translated.json
long-text-800000.json.translated.json
long-text-900000.json.translated.json
long-text-1100000.json.translated.json
long-text-1200000.json.translated.json
long-text-1300000.json.translated.json
long-text-1400000.json.translated.json
long-text-1500000.json.translated.json
long-text-1600000.json.translated.json
Alignment#
Prepare alignment for MS-EN using eflomal.
Download#
how-to#
make
make INSTALLDIR=~/.local/bin install
python3 setup.py install --user
Align MS-EN text, prepare-ms-en-fwd-rev.ipynb.
NLLB EN-MS#
Original page, https://github.com/facebookresearch/LASER/tree/main/data/nllb200
Apply filter on NLLB eng_Latn-zsm_Latn
NLLB pair dataset.
download#
Citation#
@misc{https://doi.org/10.48550/arxiv.2207.04672,
doi = {10.48550/ARXIV.2207.04672},
url = {https://arxiv.org/abs/2207.04672},
author = {{NLLB Team} and Costa-jussà, Marta R. and Cross, James and Çelebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti, Prangthip and Hoffman, John and Jarrett, Semarley and Sadagopan, Kaushik Ram and Rowe, Dirk and Spruit, Shannon and Tran, Chau and Andrews, Pierre and Ayan, Necip Fazil and Bhosale, Shruti and Edunov, Sergey and Fan, Angela and Gao, Cynthia and Goswami, Vedanuj and Guzmán, Francisco and Koehn, Philipp and Mourachko, Alexandre and Ropers, Christophe and Saleem, Safiyyah and Schwenk, Holger and Wang, Jeff},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7, 68T50},
title = {No Language Left Behind: Scaling Human-Centered Machine Translation},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution Share Alike 4.0 International}
}
Noisy EN-MS augmentation#
Augment EN-MS dataset.
download#
https://huggingface.co/datasets/mesolitica/noisy-en-ms-augmentation/resolve/main/augmented-en-ms-v2-part3.json 5.https://huggingface.co/datasets/mesolitica/noisy-en-ms-augmentation/resolve/main/augmented-en-ms-v3.json
Noisy MS-EN augmentation#
Augment MS-EN dataset.
download#
OPUS#
download#
gnome, https://f000.backblazeb2.com/file/malay-dataset/translation/opus/gnome-ms-en.json
kde4, https://f000.backblazeb2.com/file/malay-dataset/translation/opus/kde4-ms-en.json
opensubtitle, https://f000.backblazeb2.com/file/malay-dataset/translation/opus/opensubtitle-ms-en.json
qed, https://f000.backblazeb2.com/file/malay-dataset/translation/opus/qed-ms-en.json
tanzil, https://f000.backblazeb2.com/file/malay-dataset/translation/opus/tanzil-ms-en.json
ubuntu, https://f000.backblazeb2.com/file/malay-dataset/translation/opus/ubuntu-ms-en.json
Citation#
@InProceedings{TIEDEMANN12.463,
author = {Jörg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}