normalization#

IIUM Confession#

text -> google translate EN -> google translate MS.

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, IIUM Confession abstractive normalization using Google Translate,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/normalization/iium-confession}}
}

Rumi-to-Jawi#

Originally from https://www.ejawi.net/converterV2.php?go=rumi

download#

  1. Wikipedia, single words, https://f000.backblazeb2.com/file/malay-dataset/normalization/rumi-jawi/wikipedia-1word.json

  2. Wikipedia, random windows between 2-6 words, https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/wikipedia-windows.json

  3. News, random windows between 2-6 words, https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/news-windows.json

  4. Full news, random windows between 10-20 words, train set, JSONL format, https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/jawi-rumi-news-full.train

  5. Full news, random windows between 10-20 words, test set, JSONL format, https://huggingface.co/datasets/mesolitica/rumi-jawi/resolve/main/jawi-rumi-news-full.test

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Rumi-to-Jawi Dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/normalization/rumi-jawi}}
}

Stemming and Lemmatization#

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Stemming and Lemmatization Dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/normalization/stemmer}}
}

Normalization Twitter#

Normalize twitter using malaya normalization lexicon based.