question-answer =============== Common Crawl QA --------------- Generate using ChatGPT 3.5. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt-malaysian-open-qa/resolve/main/common-crawl-qa.jsonl Extractive News QA ------------------ Generate using ChatGPT 3.5. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt-malaysian-extractive-qa/resolve/main/news-extractive.jsonl Hansard QA ---------- Generate using ChatGPT 3.5. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt-malaysian-open-qa/resolve/main/hansard-qa.jsonl Synthetic QA Choice ------------------- Generated using ChatGPT3.5, 1. https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset/resolve/main/dewanbahasa-jdbp.jsonl 2. https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset/resolve/main/majalahsains.jsonl 3. https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset/resolve/main/wikipedia-2023-10-01.jsonl download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt-malaysian-qa-choice/resolve/main/qa-dewanbahasa-jdbp.jsonl 2. https://huggingface.co/datasets/mesolitica/chatgpt-malaysian-qa-choice/resolve/main/qa-majalahsains.jsonl 3. https://huggingface.co/datasets/mesolitica/chatgpt-malaysian-qa-choice/resolve/main/qa-ms-wikipedia.jsonl Synthetic Malaysian QA ---------------------- Generated common QA using ChatGPT3 for, 1. Agrobank 2. Bank Negara Malaysia 3. Bank Perusahaan Kecil dan Sederhana Malaysia 4. Bank Rakyat 5. Bank Simpanan Nasional 6. Bursa Malaysia 7. Dewan Bahasa dan Pustaka 8. Institut Kesihatan Umum 9. Institut Penyelidikan Perubatan 10. Institut Penyelidikan Sains dan Teknologi Pertahanan 11. Institut Penyelidikan Tingkahlaku Kesihatan 12. Institut Penyelidikan dan Kemajuan Pertanian Malaysia 13. Jabatan Akauntan Negara 14. Jabatan Bomba dan Penyelamat Malaysia 15. Jabatan Hal Ehwal Kesatuan Sekerja 16. Jabatan Hal Ehwal Veteran 17. Jabatan Imigresen Malaysia 18. Jabatan Kastam Diraja Malaysia 19. Jabatan Kebajikan Masyarakat 20. Jabatan Kemajuan Orang Asli 21. Jabatan Kerajaan Tempatan 22. Jabatan Kerja Raya 23. Jabatan Keselamatan Jalan Raya 24. Jabatan Keselamatan dan Keselamatan Pekerjaan 25. Jabatan Ketua Hakim Peguam 26. Jabatan Landskap Negara 27. Jabatan Latihan Khidmat Negara 28. Jabatan Laut Malaysia 29. Jabatan Pembangunan Wanita 30. Jabatan Pendaftaran Pertubuhan Malaysia 31. Jabatan Penerangan Malaysia 32. Jabatan Pengangkutan Jalan 33. Jabatan Pengurusan Sisa Pepejal Negara 34. Jabatan Penilaian dan Perkhidmatan Negara 35. Jabatan Penjara Malaysia 36. Jabatan Perancangan Bandar dan Desa 37. Jabatan Perdana Menteri Malaysia 38. Jabatan Perhubungan Perusahaan 39. Jabatan Perikanan Malaysia 40. Jabatan Perkhidmatan Kuarantin dan Pemeriksaan Malaysia 41. Jabatan Perkhidmatan Veterinar 42. Jabatan Pertanian Malaysia 43. Jabatan Perumahan Negara 44. Jabatan Perumahan dan Pengurusan Strata 45. Jabatan Sukarelawan Malaysia 46. Jabatan Tenaga Kerja 47. Jabatan Tenaga Kerja Manusia 48. Khazanah Nasional 49. Kolej Pertanian 50. Kumpulan Wang Persaraan 51. Kumpulan Wang Simpanan Pekerja 52. Lembaga Hasil Dalam Negeri Malaysia 53. Lembaga Kemajuan Ikan Malaysia 54. Lembaga Kemajuan Pertanian Kemubu 55. Lembaga Kemajuan Pertanian Muda 56. Lembaga Pelabuhan Bintulu 57. Lembaga Pelabuhan Johor 58. Lembaga Pelabuhan Klang 59. Lembaga Pelabuhan Kuantan 60. Lembaga Pemasaran Pertanian Persekutuan 61. Lembaga Pembangunan Pelaburan Malaysia 62. Lembaga Pembiayaan Perumahan Sektor Awam 63. Lembaga Penapisan Filem 64. Lembaga Penduduk dan Pembangunan Keluarga Negara 65. Lembaga Peperiksaan Malaysia 66. Lembaga Perindustrian Nanas Malaysia 67. Lembaga Perkhidmatan Kewangan Labuan 68. Lembaga Pertubuhan Peladang 69. Lembaga Promosi Kesihatan Malaysia 70. Lembaga Totalisator Malaysia 71. Pusat Pergigian Kanak-Kanak & Kolej Latihan Pergigian Malaysia download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt-malaysian-general-qa/resolve/main/malaysian-general-qa-gov-my.jsonl Wikipedia QA ------------ Generate using ChatGPT 3.5. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt-malaysian-open-qa/resolve/main/wikipedia-qa.jsonl Synthetic CommonSense --------------------- Generated using ChatGPT4, originally from https://huggingface.co/datasets/commonsense_qa download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt4-commonsense-qa/resolve/main/synthetic-commonsense.jsonl Synthetic Kertas 1 ------------------ Generated using ChatGPT4. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt4-synthetic-kertas1/resolve/main/synthetic-exam.jsonl 2. https://huggingface.co/datasets/mesolitica/chatgpt4-synthetic-kertas1/resolve/main/synthetic-tatabahasa.jsonl 3. https://huggingface.co/datasets/mesolitica/chatgpt4-synthetic-kertas1/resolve/main/synthetic-tatabahasabm.tripod.com-bm-kertas1.jsonl 4. https://huggingface.co/datasets/mesolitica/chatgpt4-synthetic-kertas1/resolve/main/synthetic-tatabahasa-v2.jsonl 5. https://huggingface.co/datasets/mesolitica/chatgpt4-synthetic-kertas1/resolve/main/synthetic-latihanbm.jsonl Synthetic Malaysian QA ---------------------- Generated common QA using ChatGPT4 for, 1. politics 2. socioeconomy 3. culture 4. gender 5. religion 6. sociology 7. social class 8. technology 9. ethnicity 10. infrastructure 11. health 12. education 13. ecology 14. party politics 15. diplomacy 16. history 17. cuisine 18. microeconomics 19. business 20. artificial intelligence 21. law 22. negeri johor 23. negeri kedah 24. negeri kelantan 25. negeri melaka 26. negeri negeri sembilan 27. negeri pahang 28. negeri perak 29. negeri perlis 30. negeri pulau pinang 31. negeri selangor 32. negeri terengganu 33. negeri sabah 34. negeri sarawak 35. kuala lumpur 36. negeri labuan 37. putrajaya 38. najib razak 39. anwar ibrahim 40. parti keadilan rakyat 41. parti islam semalaysia 42. dr mahathir mohamad 43. barisan nasional 44. constitutional monarchy 45. parliamentary democracy 46. political economy 47. political dynamic 48. empowerment of youths 49. kebebasan bersuara 50. sastera 51. tatabahasa 52. kesusasteraan melayu 53. pantun 54. sajak 55. syair 56. hadis 57. hukum aqidah islam 58. hukum fiqah islam download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt4-malaysian-general-qa/resolve/main/malaysian-general-qa.jsonl 2. https://huggingface.co/datasets/mesolitica/chatgpt4-malaysian-general-qa/resolve/main/malaysian-general-qa-v2.jsonl 3. https://huggingface.co/datasets/mesolitica/chatgpt4-malaysian-general-qa/resolve/main/malaysian-general-qa-v3.jsonl 4. https://huggingface.co/datasets/mesolitica/chatgpt4-malaysian-general-qa/resolve/main/malaysian-general-qa-v4.jsonl 5. https://huggingface.co/datasets/mesolitica/chatgpt4-malaysian-general-qa/resolve/main/malaysian-general-qa-v5.jsonl 6. https://huggingface.co/datasets/mesolitica/chatgpt4-malaysian-general-qa/resolve/main/tatabahasa.jsonl 7. https://huggingface.co/datasets/mesolitica/chatgpt4-malaysian-general-qa/resolve/main/loghat.jsonl download ~~~~~~~~ Notes to myself ~~~~~~~~~~~~~~~ 1. Filter short questions. Natural Questions ----------------- Original paper, https://research.google/pubs/pub47761/ download ~~~~~~~~ Data structure is like this, .. code:: text Question <> Answer 1. download train set here, https://f000.backblazeb2.com/file/malay-dataset/qa/natural/translated-train.json 2. download validation set here, https://f000.backblazeb2.com/file/malay-dataset/qa/natural/translated-validation.json Citation ~~~~~~~~ .. code:: bibtex @article{47761, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {Transactions of the Association of Computational Linguistics} } SQUAD ----- **Thanks to `The Translate-Align-Retrieve (TAR) method for synthetic QA corpora generation `__ for steps to translate SQUAD dataset**. Original website, https://rajpurkar.github.io/SQuAD-explorer/ Original paper, https://arxiv.org/abs/1806.03822 Step to reproduce the translation at `notebook `__. download ~~~~~~~~ 1. ms-train-1.1.json, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/ms-train-1.1.json 2. ms-dev-1.1.json, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/ms-dev-1.1.json 3. ms-train-2.0.json, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/ms-train-2.0.json 4. ms-dev-2.0.json, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/ms-dev-2.0.json Citation ~~~~~~~~ .. code:: bibtex @article{DBLP:journals/corr/abs-1806-03822, author = {Pranav Rajpurkar and Robin Jia and Percy Liang}, title = {Know What You Don't Know: Unanswerable Questions for SQuAD}, journal = {CoRR}, volume = {abs/1806.03822}, year = {2018}, url = {http://arxiv.org/abs/1806.03822}, archivePrefix = {arXiv}, eprint = {1806.03822}, timestamp = {Mon, 13 Aug 2018 16:48:21 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1806-03822.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } how-to ~~~~~~ v1.1 ^^^^ 1. train part1, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-0-100.json 2. train part2, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-100-200.json 3. train part3, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-200-300.json 4. train part4, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-300-400.json 5. train part5, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v1.1-bahasa-400-.json 6. dev, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-dev-v1.1-bahasa.json v2.0 ^^^^ 1. train part1, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-0-100.json 2. train part2, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-100-200.json 3. train part3, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-200-300.json 4. train part4, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-300-400.json 5. train part5, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-train-v2.0-bahasa-400-.json 6. dev, https://f000.backblazeb2.com/file/malay-dataset/qa/squad/translated-dev-v2.0-bahasa.json Supervised ~~~~~~~~~~ We will share supervised answers from human in `supervised `__. how-to ~~~~~~ **We use Malaya translation module to translate EN -> MS**. 1. Download alignment dataset from `Malay-Dataset/alignment `__. 2. Run notebooks. IndoNLI ------- https://huggingface.co/datasets/indonli, Translate using Malaya. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/translated-indonli/resolve/main/train.jsonl 2. https://huggingface.co/datasets/mesolitica/translated-indonli/resolve/main/validation.jsonl 3. https://huggingface.co/datasets/mesolitica/translated-indonli/resolve/main/test_expert.jsonl