normalization
Contents
normalization#
IIUM Confession#
text -> google translate EN -> google translate MS.
download#
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, IIUM Confession abstractive normalization using Google Translate,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/normalization/iium-confession}}
}
Rumi-to-Jawi#
Originally from https://www.ejawi.net/converterV2.php?go=rumi
download#
Wikipedia, single words, https://f000.backblazeb2.com/file/malay-dataset/normalization/rumi-jawi/wikipedia-1word.json
Wikipedia, random windows between 2-6 words, https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wikipedia-windows.json
News, random windows between 2-6 words, https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/news-windows.json
Full news, random windows between 10-20 words, train set, JSONL format, https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/jawi-rumi-news-full.train
Full news, random windows between 10-20 words, test set, JSONL format, https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/jawi-rumi-news-full.test
Wikipedia#
Slide random between 20 and 200 words.
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-0.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-1.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-2.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-3.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-4.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-5.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-6.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-7.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-8.jsonl
https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wiki-rumi-jawi-9.jsonl
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Rumi-to-Jawi Dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/normalization/rumi-jawi}}
}
Stemming and Lemmatization#
download#
https://f000.backblazeb2.com/file/malay-dataset/wiki-stem.json
https://huggingface.co/datasets/mesolitica/stemming/resolve/main/train_stem.json
https://huggingface.co/datasets/mesolitica/stemming/resolve/main/test_stem.json
https://huggingface.co/datasets/mesolitica/stemming/resolve/main/train_noisy_stem.json
https://huggingface.co/datasets/mesolitica/stemming/resolve/main/test_noisy_stem.json
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Stemming and Lemmatization Dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/normalization/stemmer}}
}
Normalization Twitter#
Normalize twitter using malaya normalization lexicon based.