corpus
Contents
corpus#
Audience#
Original website, https://www.kaggle.com/crowdflower/political-social-media-posts
Citation#
Auto generated using https://www.bibme.org/bibtex/website-citation,
@misc{eight_2016, title={Political Social Media Posts}, url={https://www.kaggle.com/crowdflower/political-social-media-posts}, journal={Kaggle}, author={Eight, Figure}, year={2016}, month={Nov}}
Emotion#
Gathered emotion dataset using lexicon, all steps in notebook.
download#
anger 108813
fear 20316
happy 30962
love 20783
sadness 26468
surprise 13107
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semi-Supervised Emotion dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/corpus/emotion}}
}
Gender#
Original website, https://www.kaggle.com/crowdflower/twitter-user-gender-classification
Citation#
Auto generated using https://www.bibme.org/bibtex/website-citation,
@misc{eight_2016, title={Twitter User Gender Classification}, url={https://www.kaggle.com/crowdflower/twitter-user-gender-classification}, journal={Kaggle}, author={Eight, Figure}, year={2016}, month={Nov}}
GoEmotions#
Original website, https://github.com/google-research/google-research/tree/master/goemotions
Download#
Citation#
@article{DBLP:journals/corr/abs-2005-00547,
author = {Dorottya Demszky and
Dana Movshovitz{-}Attias and
Jeongwoo Ko and
Alan S. Cowen and
Gaurav Nemade and
Sujith Ravi},
title = {GoEmotions: {A} Dataset of Fine-Grained Emotions},
journal = {CoRR},
volume = {abs/2005.00547},
year = {2020},
url = {https://arxiv.org/abs/2005.00547},
eprinttype = {arXiv},
eprint = {2005.00547},
timestamp = {Fri, 08 May 2020 15:04:04 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2005-00547.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Insincere Question#
Original website, https://www.kaggle.com/c/quora-insincere-questions-classification
Citation#
Auto generated using https://www.bibme.org/bibtex/website-citation,
@misc{kaggle, title={Quora Insincere Questions Classification}, url={https://www.kaggle.com/c/quora-insincere-questions-classification}, journal={Kaggle}}
Irony#
Original website, https://www.kaggle.com/rtatman/ironic-corpus
Citation#
Auto generated using https://www.bibme.org/bibtex/website-citation,
@misc{tatman_2017, title={Ironic Corpus}, url={https://www.kaggle.com/rtatman/ironic-corpus}, journal={Kaggle}, author={Tatman, Rachael}, year={2017}, month={Jul}}
Language Detection#
Gathered language detection dataset using lexicon, all steps in notebook.
download#
Download dataset from here, https://huggingface.co/datasets/mesolitica/language-detection/resolve/main/train-test.json
Splitted 80% to train and 20% to test.
Labels,
english, 2215975, 553739
malay, 7202654, 1800649
indonesia, 2295708, 576059
rojak, 757559, 189678
manglish, 726678, 181442
others, 5720022, 1428083
Download dataset from here, https://huggingface.co/datasets/mesolitica/language-detection/resolve/main/sublanguages.json
Labels,
malay 7179851
kedah 14071
johor 2172
melaka 7714
terengganu 4436
sarawak 6429
negeri-sembilan 7717
kelantan 2305
pahang 3647
perak 1307
sabah 1253
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Lexicon based Language Detection dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/corpus/language-detection}}
}
Malaysia Entities#
Social media texts related to Malaysia entities using lexicon.
List#
Complete list (210 entities)
mahathir
anwar ibrahim
najib razak
pakatan harapan
syed saddiq
parti keadilan rakyat
umno
barisan nasional
parti islam semalaysia
nurul izzah
tunku ismail idris
mca
democratic action party
parti amanah
ppbm
mic
tun daim zainuddin
datuk seri abdul hadi awang
majlis pakatan harapan
wan azizah
parti pribumi bersatu malaysia
datuk seri azmin ali
datuk johari abdul
tengku razaleigh hamzah
tan sri dr rais yatim
rafizi ramli
bersatu
bernama
donald trump
perkasa
tan sri mokhzani mahathir
rais yatim
anthony loke siew fook
rosmah mansur
arul kanda
zeti aziz
robert kuok
hassan merican
ks jomo
jho low
kadir jasin
zakir naik
bung mokhtar
shafie apdal
ariff md yusof
felda
dato vida
jabatan perancangan bandar desa
jabatan perdana menteri malaysia
kementerian kewangan malaysia
kementerian dalam negeri malaysia
kementerian perdagangan dalam negeri hal ehwal pengguna malaysia
kementerian luar negeri malaysia
kementerian pertahanan malaysia
kementerian pendidikan malaysia
kementerian pembangunan luar bandar
kementerian kerja raya malaysia
kementerian kesihatan malaysia
kementerian komunikasi multimedia malaysia
kementerian perumahan kerajaan tempatan malaysia
kementerian pelancongan kebudayaan malaysia
kementerian pengangkutan malaysia
kementerian pembangunan wanita keluarga masyarakat malaysia
kementerian pertanian industri asas tani
kementerian perusahaan perladangan komoditi
kementerian perdagangan antarabangsa industri
kementerian sains teknologi inovasi malaysia
kementerian sumber manusia malaysia
kementerian sumber asli alam sekitar malaysia
kementerian wilayah persekutuan malaysia
kementerian tenaga teknologi hijau air malaysia
jabatan perkhidmatan awam malaysia
jabatan kemajuan islam (jakim) department of islamic development
jabatan parlimen malaysia
agensi kelayakan malaysia
agensi penguatkuasaan maritim malaysia
bahagian istiadat urusetia persidangan antarabangsa
bahagian hal ehwal undang-undang
bahagian kabinet perlembangan perhubungan antara kerajaan
bahagian kemajuan wilayah persekutuan perancangan lembah klang
bahagian keselamatan negara
bahagian pengurusan hartanah
bahagian pengurusan perkhidmatan sumber manusia
bahagian penyelidikan
biro bantuan guaman
biro pengaduan awam
biro tatanegara
istana negara
institut kefahaman islam malaysia
institut latihan kehakiman perundangan
pejabat ketua setiausaha negara
pejabat perdana menteri
jabatan peguam negara
majlis agama islam wilayah persekutuan
masjid negara
pejabat ketua pegawai keselamatan kerajaan malaysia
pejabat setiausaha persekutuan sabah
perpustakaan kuala lumpur
pejabat setiausaha persekutuan sarawak
lembaga tabung haji
penasihat sains
jabatan audit negara malaysia
jabatan pertahanan awam malaysia
suruhanjaya pengankutan awam darat
perbendaharaan malaysia
majlis tindakan ekonomik negara
jabatan perangkaan (jp) department of statistics
polis diraja malaysia
ikatan relawan rakyat malaysia
jabatan penjara malaysia
jabatan pendaftaran negara malaysia
lembaga penapisan filem
jabatan imigresen malaysia
suruhanjaya syarikat malaysia
suruhanjaya koperasi malaysia
perbadanan harta intelek malaysia
bank kerjasama rakyat malaysia
perbadanan nasional berhad
maktab koperasi malaysia
suruhanjaya persaingan malaysia
institut diplomasi hal ehwal luar negeri
angkatan tentera malaysia
tentera darat malaysia
tentera udara diraja malaysia
tentera laut diraja malaysia
program latihan khidmat negara
dewan bahasa pustaka
institut pendidikan guru malaysia
perbadanan tabung pendidikan tinggi nasional
institut terjemahan negara malaysia
kejora
felcra
risda
jabatan kerja raya malaysia
lembaga lebuhraya malaysia
lembaga jurutera malaysia
lembaga pembangunan industri pembinaan
institut jantung negara
klinik 1malaysia
insitut kanser negara
radio televisyen malaysia
suruhanjaya komunikasi multimedia malaysia
jabatan penerangan malaysia
jabatan perancangan bandar desa semenanjung malaysia
jabatan bomba penyelamat malaysia
jabatan perumahan negara
jabatan kerajaan tempatan
jabatan landskap negara
jabatan pengurusan sisa pepejal negara
tribunal perumahan pengurusan strata
perbadanan pengurusan sisa pepejal pembersihan awam
jabatan pelancongan malaysia
jabatan pengangkutan jalan
jabatan penerbangan awam
lembaga pelabuhan klang
jabatan laut malaysia
jabatan keselamatan jalan raya
lembaga pelabuhan kuantan
lembaga pelabuhan johor
lembaga pelabuhan pulau pinang
jabatan kebajikan masyarakat malaysia
institut penyelidikan kemajuan pertanian malaysia
lembaga kemajuan ikan malaysia
lembaga pemasaran pertanian persekutuan
jabatan pertanian malaysia
lembaga pertubuhan peladang
lembaga kemajuan pertanian kemubu
lembaga kemajuan pertanian muda
jabatan perikanan
jabatan perkhidmatan veterinar
lembaga perindustrian nanas malaysia
tabung ekonomi kumpulan usaha niaga
bank pertanian
lembaga minyak sawit malaysia
lembaga pembangunan pelaburan malaysia
agensi nuklear malaysia
institut penyelidikan teknologi nuklear malaysia
pusat sains negara
jabatan kimia malaysia
jabatan meteorologi malaysia
jabatan perkhidmatan awam
institut tadbiran awam negara
jabatan agama islam wilayah persekutuan
jabatan tenaga kerja semenanjung malaysia
jabatan alam sekitar
jabatan pengairan saliran
jabatan tanah galian wilayah persekutuan
jabatan perlindungan hidupan liar taman negara
dewan bandaraya kuala lumpur
perbadanan putrajaya
perbadanan labuan
jabatan bekalan air
jabatan perkhidmatan pembetungan
suruhanjaya tenaga
suruhanjaya perkhidmatan air negara
malaysian green technology corporation
yayasan hijau malaysia
mahkamah persekutuan
mahkamah syariah wilayah persekutuan
suruhanjaya perdagangan komoditi
suruhanjaya perkhidmatan awam
suruhanjaya perkhidmatan pendidikan
suruhanjaya pilihan raya
suruhanjaya pencegahan rasuah malaysia
tribunal perkhidmatan awam
unit khas teknologi tinggi
unit pemodenan tadbiran perancangan pengurusan malaysia
unit perancang ekonomi
unit penyelarasan pelaksanaan
urusetia persidangan antarabangsa protokol
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Lexicon based Malaysia Entities dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/corpus/malaysia-entities}}
}
Malaysia Topics#
Social media texts related to Malaysia topics using lexicon.
List#
Complete list (249 topics)
ganja
orang asli
kaum cina
k-pop
kaum india
pos laju
hari raya aidilfitri
hari raya aidiladha
syarikat permulaan
isu tanah
kaum melayu
facebook
keluar parti
sabotaj parti
kotak undi
humanoid
kemalangan penumpang cedera
kemalangan maut
individu penjara
kes rogol
kes cabul
kes rompakan
kes ragut
cambridge analytica
kokain
bebas tahanan
sosial media
twitter
instagram
mati dipukul
pengedar dadah
kematian wabak
letupan bom
isu dadah
isu bmf
isu diesel
isu china
isu saudi arabia
unifi
piala thomas
fifa
bahasa pengaturcaraan
baling botol
perkahwinan kanak-kanak
produk berbahaya
musim durian
world cup
motogp
euro 2020
ask me a question
thai cave
racist
bola sepak
hockey
sepak takraw
reformasi
deepavali
chinese new year
lazada sells
shopee sells
e-sport
valve corporation
dota2
counter strike global-offensive
asean football organization
blackpink
kecurian kereta
kecurian motosikal
youtube rewind
pewdiepie
isu tiket
kuota haji
tsunami
kes lemas
kes buang bayi
kes pecah rumah
paedophilia
kes luar nikah
kes tangkap basah
kes bawah umur
pdrm
1mdb
gst
sst
tiga penjuru
pilihan raya umum
pilihan raya kecil
pusat daerah mangundi
masalah air
rumah mampu milik
pendidikan
sekolah
universiti
maktab rendah sains mara
kesihatan
hutang negara
ekonomi
sosial
menteri besar kedah
menteri besar perak
menteri besar perlis
menteri besar selangor
menteri besar johor
menteri besar kelantan
menteri besar terengganu
menteri besar negeri sembilan
felda
kwsp
sosco
bank malaysia
bank negara
perdana menteri
timbalan perdana menteri
menteri dalam negeri
menteri kewangan
menteri pertahanan
menteri belia dan sukan
majlis penasihat
skim peduli sihat
ptptn
projek mega
gaji minimum
menyiasat skandal
highway tol
tabung haji
tentera malaysia
infrastruktur
kos sara hidup
pengangkutan awam
perkhidmatan awam
isu wanita
survei institut darul ehsan
inisiatif peduli rakyat
teknologi
internet
kecerdasan buatan
ahli dewan undangan negeri
suruhanjaya pilihan raya malaysia
kertas undi
akta pilihan raya
undi pos
undi rosak
harga minyak
petrol
subsidi kerajaan
mh370
gaji menteri
jabatan bubar
telekom malaysia
agama
lgbt
agama islam
masyarakat
liberalisme
kapitalisme
idealogi
parlimen
pusat transformasi bandar
institut diraja
tsunami fitnah
makro-ekonomi
mikro-ekonomi
pasaran saham malaysia
pendapatan negara
nilai ringgit jatuh
gaji median
bursa malaysia
malaysia baru
keluar parlimen
dewan rakyat
tabung harapan
isu singapura
isu rohingya
isu syria
malaysia-indonesia
isu gaza
isu palestin
isu yaman
harimau malaya
isu kuil
isu lynas
isu masjid
isu sosma
isu ecrl
royalti minyak
kes rasuah
kewangan dan perniagaan
saham dan komoditi
isu kerugian
bumiputera
alam sekitar
isu kemiskinan
sumber asli
pertanian malaysia
pertanian durian
pertanian padi
pertanian getah
pertanian kelapa sawit
pertanian pisang
pertanian nenas
akuakultur malaysia
hortikultur malaysia
icerd
yang di-pertuan agong
perlembagaan malaysia
malaysia airlines
malaysia airport
kuala lumpur international airport
malacca airport
bintulu airport
kota kinabalu airport
kuching airport
labuan airport
lahad datu airport
langkawi airport
limbang airport
miri airport
penang airport
sandakan airport
sibu airport
sultan abdul halim airport
sultan haji ahmad shah airport
sultan azlan shah airport
sultan ismail petra airport
sultan mahmud airport
tawau airport
tioman airport
anggota bomba
angkatan tentera darat
angkatan tentera laut
angkatan tentera udara
anggota ambulans
anggota polis
perkhidmatan kehakiman
perkhidmatan am persekutuan
industri 4.0
kumpulan pengganas tempatan
kumpulan pengganas asing
sultan selangor
sultan kedah
sultan kelantan
sultan perlis
sultan johor
sultan negeri sembilan
sultan terengganu
pemilihan agong
isu plastik
gejala sosial
isytihar darurat
download#
Download dataset from here, https://huggingface.co/datasets/mesolitica/malaysian-twitter-by-topics/resolve/main/malaysia-topics.zip
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Lexicon based Malaysia Topics dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/corpus/malaysia-topics}}
}
Amazon Review Data#
Originally from https://nijianmo.github.io/amazon/
download#
Citation#
Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019
NSFW#
Gathered NSFW dataset using lexicon, all steps in [notebook].
download#
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Lexicon based NSFW Detection dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/corpus/nsfw}}
}
The Pile#
Translating The Pile using Malaya EN-MS model.
Original paper, https://arxiv.org/abs/2101.00027
Original website, https://pile.eleuther.ai/
download#
jsonl format, check download.txt.
Citation#
@article{DBLP:journals/corr/abs-2101-00027,
author = {Leo Gao and
Stella Biderman and
Sid Black and
Laurence Golding and
Travis Hoppe and
Charles Foster and
Jason Phang and
Horace He and
Anish Thite and
Noa Nabeshima and
Shawn Presser and
Connor Leahy},
title = {The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
journal = {CoRR},
volume = {abs/2101.00027},
year = {2021},
url = {https://arxiv.org/abs/2101.00027},
archivePrefix = {arXiv},
eprint = {2101.00027},
timestamp = {Thu, 21 Jan 2021 14:42:30 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2101-00027.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Political Landscape#
Deprecated, will update soon.
Political Landscape detection dataset using lexicon.
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Lexicon based Political Landscape Detection dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/corpus/political-landscape}}
}
News Headlines Dataset For Sarcasm Detection#
Original website, https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection
Citation#
@misc{misra_2019, title={News Headlines Dataset For Sarcasm Detection}, url={https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection}, journal={Kaggle}, author={Misra, Rishabh}, year={2019}, month={Jul}}
Subjectivity#
Original website, http://www.cs.cornell.edu/people/pabo/movie-review-data/
@InProceedings{Pang+Lee:04a,
author = {Bo Pang and Lillian Lee},
title = {A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts},
booktitle = "Proceedings of the ACL",
year = 2004
}
Substring language detection#
Only available ['MS', 'EN', 'OTHERS', 'CAPITAL', 'NOT_LANG']
.
download#
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Substring language detection,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/corpus/substring-language-detection}}
}
Toxicity Large#
Original website, https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification
Added a few local toxicity keywords using lexicon, all steps in notebook.
download#
url, https://f000.backblazeb2.com/file/malay-dataset/toxicity/
translated-0.json
translated-1000000.json
translated-1050000.json
translated-1100000.json
translated-1150000.json
translated-1200000.json
translated-1450000.json
translated-150000.json
translated-1500000.json
translated-1550000.json
translated-1600000.json
translated-1650000.json
translated-1700000.json
translated-1750000.json
translated-1800000.json
translated-250000.json
translated-300000.json
translated-350000.json
translated-400000.json
translated-450000.json
translated-50000.json
translated-500000.json
translated-550000.json
translated-600000.json
translated-650000.json
translated-700000.json
translated-750000.json
translated-850000.json
translated-900000.json
translated-950000.json
chinese, malay and indian labels from local tweets, https://f000.backblazeb2.com/file/malay-dataset/toxicity/kaum.json
Weak learning score using BERT Base for chinese, malay and indian labels, https://f000.backblazeb2.com/file/malay-dataset/toxicity/weak-learning-toxicity.json
Citation#
@misc{kaggle, title={Jigsaw Multilingual Toxic Comment Classification}, url={https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification}, journal={Kaggle}}
Toxicity Small#
Original website, https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
download#
part 1, https://f000.backblazeb2.com/file/malay-dataset/toxicity-small/toxic0.json
part 2, https://f000.backblazeb2.com/file/malay-dataset/toxicity-small/toxic1.json
part 3, https://f000.backblazeb2.com/file/malay-dataset/toxicity-small/toxic2.json
part 4, https://f000.backblazeb2.com/file/malay-dataset/toxicity-small/toxic3.json
part 5, https://f000.backblazeb2.com/file/malay-dataset/toxicity-small/toxic4.json
part 6, https://f000.backblazeb2.com/file/malay-dataset/toxicity-small/toxic5.json
part 7, https://f000.backblazeb2.com/file/malay-dataset/toxicity-small/toxic6.json
part 8, https://f000.backblazeb2.com/file/malay-dataset/toxicity-small/toxic7.json
Citation#
@misc{kaggle, title={Toxic Comment Classification Challenge}, url={https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge}, journal={Kaggle}}