question-answer
Contents
question-answer#
Common Crawl QA#
Generate using ChatGPT 3.5.
Extractive News QA#
Generate using ChatGPT 3.5.
Hansard QA#
Generate using ChatGPT 3.5.
Synthetic QA Choice#
Generated using ChatGPT3.5,
https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset/resolve/main/dewanbahasa-jdbp.jsonl
https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset/resolve/main/majalahsains.jsonl
download#
Synthetic Malaysian QA#
Generated common QA using ChatGPT3 for,
Agrobank
Bank Negara Malaysia
Bank Perusahaan Kecil dan Sederhana Malaysia
Bank Rakyat
Bank Simpanan Nasional
Bursa Malaysia
Dewan Bahasa dan Pustaka
Institut Kesihatan Umum
Institut Penyelidikan Perubatan
Institut Penyelidikan Sains dan Teknologi Pertahanan
Institut Penyelidikan Tingkahlaku Kesihatan
Institut Penyelidikan dan Kemajuan Pertanian Malaysia
Jabatan Akauntan Negara
Jabatan Bomba dan Penyelamat Malaysia
Jabatan Hal Ehwal Kesatuan Sekerja
Jabatan Hal Ehwal Veteran
Jabatan Imigresen Malaysia
Jabatan Kastam Diraja Malaysia
Jabatan Kebajikan Masyarakat
Jabatan Kemajuan Orang Asli
Jabatan Kerajaan Tempatan
Jabatan Kerja Raya
Jabatan Keselamatan Jalan Raya
Jabatan Keselamatan dan Keselamatan Pekerjaan
Jabatan Ketua Hakim Peguam
Jabatan Landskap Negara
Jabatan Latihan Khidmat Negara
Jabatan Laut Malaysia
Jabatan Pembangunan Wanita
Jabatan Pendaftaran Pertubuhan Malaysia
Jabatan Penerangan Malaysia
Jabatan Pengangkutan Jalan
Jabatan Pengurusan Sisa Pepejal Negara
Jabatan Penilaian dan Perkhidmatan Negara
Jabatan Penjara Malaysia
Jabatan Perancangan Bandar dan Desa
Jabatan Perdana Menteri Malaysia
Jabatan Perhubungan Perusahaan
Jabatan Perikanan Malaysia
Jabatan Perkhidmatan Kuarantin dan Pemeriksaan Malaysia
Jabatan Perkhidmatan Veterinar
Jabatan Pertanian Malaysia
Jabatan Perumahan Negara
Jabatan Perumahan dan Pengurusan Strata
Jabatan Sukarelawan Malaysia
Jabatan Tenaga Kerja
Jabatan Tenaga Kerja Manusia
Khazanah Nasional
Kolej Pertanian
Kumpulan Wang Persaraan
Kumpulan Wang Simpanan Pekerja
Lembaga Hasil Dalam Negeri Malaysia
Lembaga Kemajuan Ikan Malaysia
Lembaga Kemajuan Pertanian Kemubu
Lembaga Kemajuan Pertanian Muda
Lembaga Pelabuhan Bintulu
Lembaga Pelabuhan Johor
Lembaga Pelabuhan Klang
Lembaga Pelabuhan Kuantan
Lembaga Pemasaran Pertanian Persekutuan
Lembaga Pembangunan Pelaburan Malaysia
Lembaga Pembiayaan Perumahan Sektor Awam
Lembaga Penapisan Filem
Lembaga Penduduk dan Pembangunan Keluarga Negara
Lembaga Peperiksaan Malaysia
Lembaga Perindustrian Nanas Malaysia
Lembaga Perkhidmatan Kewangan Labuan
Lembaga Pertubuhan Peladang
Lembaga Promosi Kesihatan Malaysia
Lembaga Totalisator Malaysia
Pusat Pergigian Kanak-Kanak & Kolej Latihan Pergigian Malaysia
Wikipedia QA#
Generate using ChatGPT 3.5.
Synthetic CommonSense#
Generated using ChatGPT4, originally from https://huggingface.co/datasets/commonsense_qa
Synthetic Malaysian QA#
Generated common QA using ChatGPT4 for,
politics
socioeconomy
culture
gender
religion
sociology
social class
technology
ethnicity
infrastructure
health
education
ecology
party politics
diplomacy
history
cuisine
microeconomics
business
artificial intelligence
law
negeri johor
negeri kedah
negeri kelantan
negeri melaka
negeri negeri sembilan
negeri pahang
negeri perak
negeri perlis
negeri pulau pinang
negeri selangor
negeri terengganu
negeri sabah
negeri sarawak
kuala lumpur
negeri labuan
putrajaya
najib razak
anwar ibrahim
parti keadilan rakyat
parti islam semalaysia
dr mahathir mohamad
barisan nasional
constitutional monarchy
parliamentary democracy
political economy
political dynamic
empowerment of youths
kebebasan bersuara
sastera
tatabahasa
kesusasteraan melayu
pantun
sajak
syair
hadis
hukum aqidah islam
hukum fiqah islam
download#
download#
Notes to myself#
Filter short questions.
Natural Questions#
Original paper, https://research.google/pubs/pub47761/
download#
Data structure is like this,
Question <> Answer
download train set here, https://f000.backblazeb2.com/file/malay-dataset/qa/natural/translated-train.json
download validation set here, https://f000.backblazeb2.com/file/malay-dataset/qa/natural/translated-validation.json
Citation#
@article{47761,
title = {Natural Questions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}
SQUAD#
Thanks to `The Translate-Align-Retrieve (TAR) method for synthetic QA corpora generation <https://github.com/ccasimiro88/TranslateAlignRetrieve>`__ for steps to translate SQUAD dataset.
Original website, https://rajpurkar.github.io/SQuAD-explorer/
Original paper, https://arxiv.org/abs/1806.03822
Step to reproduce the translation at notebook.
download#
ms-train-1.1.json, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/ms-train-1.1.json
ms-dev-1.1.json, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/ms-dev-1.1.json
ms-train-2.0.json, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/ms-train-2.0.json
ms-dev-2.0.json, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/ms-dev-2.0.json
Citation#
@article{DBLP:journals/corr/abs-1806-03822,
author = {Pranav Rajpurkar and
Robin Jia and
Percy Liang},
title = {Know What You Don't Know: Unanswerable Questions for SQuAD},
journal = {CoRR},
volume = {abs/1806.03822},
year = {2018},
url = {http://arxiv.org/abs/1806.03822},
archivePrefix = {arXiv},
eprint = {1806.03822},
timestamp = {Mon, 13 Aug 2018 16:48:21 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1806-03822.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
how-to#
v1.1#
train part1, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-0-100.json
train part2, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-100-200.json
train part3, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-200-300.json
train part4, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-300-400.json
train part5, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-400-.json
dev, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-dev-v1.1-bahasa.json
v2.0#
train part1, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-0-100.json
train part2, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-100-200.json
train part3, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-200-300.json
train part4, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-300-400.json
train part5, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-400-.json
dev, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-dev-v2.0-bahasa.json
Supervised#
We will share supervised answers from human in supervised.
how-to#
We use Malaya translation module to translate EN -> MS.
Download alignment dataset from Malay-Dataset/alignment.
Run notebooks.
IndoNLI#
https://huggingface.co/datasets/indonli, Translate using Malaya.