summarization ============= ChatGPT Bahasa News Summarization --------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/chatgpt-ms-news-summarization/resolve/main/summarization.json CNN News -------- Original paper, https://arxiv.org/pdf/1704.04368.pdf download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/translated-cnn-dailymail/resolve/main/test-translated-cnn-daily.jsonl 2. https://huggingface.co/datasets/mesolitica/translated-cnn-dailymail/resolve/main/train-translated-cnn-daily.jsonl 3. https://huggingface.co/datasets/mesolitica/translated-cnn-dailymail/resolve/main/val-translated-cnn-daily.jsonl Citation ~~~~~~~~ .. code:: bibtex @article{DBLP:journals/corr/SeeLM17, author = {Abigail See and Peter J. Liu and Christopher D. Manning}, title = {Get To The Point: Summarization with Pointer-Generator Networks}, journal = {CoRR}, volume = {abs/1704.04368}, year = {2017}, url = {http://arxiv.org/abs/1704.04368}, archivePrefix = {arXiv}, eprint = {1704.04368}, timestamp = {Mon, 13 Aug 2018 16:46:08 +0200}, biburl = {https://dblp.org/rec/journals/corr/SeeLM17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } CNN News -------- Original paper, https://arxiv.org/pdf/1704.04368.pdf download ~~~~~~~~ 1. translated-cnn.json, https://f000.backblazeb2.com/file/malay-dataset/summarization/cnn-news/translated-cnn.json 2. train set, https://f000.backblazeb2.com/file/malay-dataset/summarization/cnn-news/translated-cnn-train.json 3. test set, https://f000.backblazeb2.com/file/malay-dataset/summarization/cnn-news/translated-cnn-test.json Citation ~~~~~~~~ .. code:: bibtex @article{DBLP:journals/corr/SeeLM17, author = {Abigail See and Peter J. Liu and Christopher D. Manning}, title = {Get To The Point: Summarization with Pointer-Generator Networks}, journal = {CoRR}, volume = {abs/1704.04368}, year = {2017}, url = {http://arxiv.org/abs/1704.04368}, archivePrefix = {arXiv}, eprint = {1704.04368}, timestamp = {Mon, 13 Aug 2018 16:46:08 +0200}, biburl = {https://dblp.org/rec/journals/corr/SeeLM17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } Download ~~~~~~~~ 1. part1, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/translated-0-5000.json 2. part2, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/cnn-news-translated-5000-10000.json 3. part3, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/translated-10000-20000.json 4. part4, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/translated-20000-30000.json 5. part5, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/cnn-news-translated-30000-40000.json 6. part6, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/cnn-news-translated-50000-60000.json 7. part7, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/cnn-news-translated-60000-70000.json 8. part8, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/cnn-news-translated-70000-80000.json 9. part9, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/cnn-news-translated-80000-90000.json 10. part10, https://f000.backblazeb2.com/file/malay-dataset/summary/cnn/cnn-news-translated-90000-100000.json Dailymail --------- Original paper, https://arxiv.org/pdf/1704.04368.pdf download ~~~~~~~~ 1. translated-dailymail.json, https://f000.backblazeb2.com/file/malay-dataset/summarization/dailymail/translated-dailymail.json 2. train set, https://f000.backblazeb2.com/file/malay-dataset/summarization/dailymail/translated-dailymail-train.json 3. test set, https://f000.backblazeb2.com/file/malay-dataset/summarization/dailymail/translated-dailymail-test.json Citation ~~~~~~~~ .. code:: bibtex @article{DBLP:journals/corr/SeeLM17, author = {Abigail See and Peter J. Liu and Christopher D. Manning}, title = {Get To The Point: Summarization with Pointer-Generator Networks}, journal = {CoRR}, volume = {abs/1704.04368}, year = {2017}, url = {http://arxiv.org/abs/1704.04368}, archivePrefix = {arXiv}, eprint = {1704.04368}, timestamp = {Mon, 13 Aug 2018 16:46:08 +0200}, biburl = {https://dblp.org/rec/journals/corr/SeeLM17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } Gigawords --------- download ~~~~~~~~ 1. part1, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-0.json 2. part2, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-100000.json 3. part3, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-200000.json 4. part4, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-300000.json 5. part5, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-400000.json 6. part6, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-500000.json 7. part7, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-600000.json 8. part8, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-700000.json 9. part9, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-800000.json 10. part10, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-900000.json 11. part11, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-1000000.json 12. part12, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-1100000.json 13. part13, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-1200000.json 14. part14, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-1300000.json 15. part15, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-1400000.json 16. part16, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-1500000.json 17. part17, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-1600000.json 18. part18, https://f000.backblazeb2.com/file/malay-dataset/summary/gigawords/translated-1700000.json Citation ~~~~~~~~ .. code:: bibtex @article{graff2003english, title={English gigaword}, author={Graff, David and Kong, Junbo and Chen, Ke and Maeda, Kazuaki}, journal={Linguistic Data Consortium, Philadelphia}, volume={4}, number={1}, pages={34}, year={2003} } @article{Rush_2015, title={A Neural Attention Model for Abstractive Sentence Summarization}, url={http://dx.doi.org/10.18653/v1/D15-1044}, DOI={10.18653/v1/d15-1044}, journal={Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing}, publisher={Association for Computational Linguistics}, author={Rush, Alexander M. and Chopra, Sumit and Weston, Jason}, year={2015} } Multinews --------- download ~~~~~~~~ prefix, https://f000.backblazeb2.com/file/malay-dataset/ 1. summary/multinews/translated-multinews-0.json 2. summary/multinews/translated-multinews-10500.json 3. summary/multinews/translated-multinews-11200.json 4. summary/multinews/translated-multinews-11900.json 5. summary/multinews/translated-multinews-12600.json 6. summary/multinews/translated-multinews-13300.json 7. summary/multinews/translated-multinews-1400.json 8. summary/multinews/translated-multinews-14000.json 9. summary/multinews/translated-multinews-14700.json 10. summary/multinews/translated-multinews-15400.json 11. summary/multinews/translated-multinews-16100.json 12. summary/multinews/translated-multinews-16800.json 13. summary/multinews/translated-multinews-17500.json 14. summary/multinews/translated-multinews-18200.json 15. summary/multinews/translated-multinews-18900.json 16. summary/multinews/translated-multinews-19600.json 17. summary/multinews/translated-multinews-20300.json 18. summary/multinews/translated-multinews-2100.json 19. summary/multinews/translated-multinews-21000.json 20. summary/multinews/translated-multinews-21700.json 21. summary/multinews/translated-multinews-22400.json 22. summary/multinews/translated-multinews-23100.json 23. summary/multinews/translated-multinews-23800.json 24. summary/multinews/translated-multinews-24500.json 25. summary/multinews/translated-multinews-25200.json 26. summary/multinews/translated-multinews-25900.json 27. summary/multinews/translated-multinews-26600.json 28. summary/multinews/translated-multinews-27300.json 29. summary/multinews/translated-multinews-2800.json 30. summary/multinews/translated-multinews-28000.json 31. summary/multinews/translated-multinews-28700.json 32. summary/multinews/translated-multinews-29400.json 33. summary/multinews/translated-multinews-30100.json 34. summary/multinews/translated-multinews-30800.json 35. summary/multinews/translated-multinews-31500.json 36. summary/multinews/translated-multinews-32200.json 37. summary/multinews/translated-multinews-32900.json 38. summary/multinews/translated-multinews-33600.json 39. summary/multinews/translated-multinews-34300.json 40. summary/multinews/translated-multinews-3500.json 41. summary/multinews/translated-multinews-35000.json 42. summary/multinews/translated-multinews-35700.json 43. summary/multinews/translated-multinews-36400.json 44. summary/multinews/translated-multinews-37100.json 45. summary/multinews/translated-multinews-37800.json 46. summary/multinews/translated-multinews-38500.json 47. summary/multinews/translated-multinews-39200.json 48. summary/multinews/translated-multinews-39900.json 49. summary/multinews/translated-multinews-40600.json 50. summary/multinews/translated-multinews-41300.json 51. summary/multinews/translated-multinews-4200.json 52. summary/multinews/translated-multinews-42000.json 53. summary/multinews/translated-multinews-42700.json 54. summary/multinews/translated-multinews-43400.json 55. summary/multinews/translated-multinews-44100.json 56. summary/multinews/translated-multinews-44800.json 57. summary/multinews/translated-multinews-45500.json 58. summary/multinews/translated-multinews-46200.json 59. summary/multinews/translated-multinews-46900.json 60. summary/multinews/translated-multinews-47600.json 61. summary/multinews/translated-multinews-48300.json 62. summary/multinews/translated-multinews-4900.json 63. summary/multinews/translated-multinews-49000.json 64. summary/multinews/translated-multinews-49700.json 65. summary/multinews/translated-multinews-50400.json 66. summary/multinews/translated-multinews-51100.json 67. summary/multinews/translated-multinews-51800.json 68. summary/multinews/translated-multinews-52500.json 69. summary/multinews/translated-multinews-53200.json 70. summary/multinews/translated-multinews-53900.json 71. summary/multinews/translated-multinews-54600.json 72. summary/multinews/translated-multinews-55300.json 73. summary/multinews/translated-multinews-5600.json 74. summary/multinews/translated-multinews-56000.json 75. summary/multinews/translated-multinews-6300.json 76. summary/multinews/translated-multinews-700.json 77. summary/multinews/translated-multinews-7000.json 78. summary/multinews/translated-multinews-7700.json 79. summary/multinews/translated-multinews-8400.json 80. summary/multinews/translated-multinews-9100.json 81. summary/multinews/translated-multinews-9800.json Citation ~~~~~~~~ .. code:: bibtex @article{DBLP:journals/corr/abs-1906-01749, author = {Alexander R. Fabbri and Irene Li and Tianwei She and Suyi Li and Dragomir R. Radev}, title = {Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model}, journal = {CoRR}, volume = {abs/1906.01749}, year = {2019}, url = {http://arxiv.org/abs/1906.01749}, archivePrefix = {arXiv}, eprint = {1906.01749}, timestamp = {Thu, 13 Jun 2019 13:36:00 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1906-01749.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } Semisupervised Bahasa AstroAwani News Summarization --------------------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-bisnes.json.nested.semisupervised 2. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-dunia.json.nested.semisupervised 3. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-hiburan.json.nested.semisupervised 4. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-malaysia.json.nested.semisupervised 5. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-politik.json.nested.semisupervised 6. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-sukan.json.nested.semisupervised 7. https://huggingface.co/datasets/mesolitica/crawl-astroawani/raw/main/berita-teknologi.json.nested.semisupervised 8. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/gaya-hidup.json.nested.semisupervised Semisupervised Bahasa News Summarization ---------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/semisupervised-abstractive-summarization-ms-news/resolve/main/populate-news.json.semisupervised how-to ~~~~~~ Sentiment labels for `cnn-news `__, `multinews `__, and `semisupervised `__. cnn """ 1. part 1, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/cnn-summarization-0.tsv.sentiment 2. part 2, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/cnn-summarization-1.tsv.sentiment multinews """"""""" 1. part 1, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/multinews-summarization-0.tsv.sentiment 2. part 2, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/multinews-summarization-1.tsv.sentiment semisupervised """""""""""""" 1. part 1, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/summary.tsv.sentiment news title """""""""" 1. part 1, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/news-title-0.tsv.sentiment 2. part 2, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/news-title-1.tsv.sentiment 3. part 3, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/news-title-2.tsv.sentiment 4. part 4, https://f000.backblazeb2.com/file/malay-dataset/news/summary/sentiment/news-title-3.tsv.sentiment Xwikis ------ Original paper, https://arxiv.org/abs/2202.09583 Huggingface page, https://huggingface.co/datasets/GEM/xwikis download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/translated-xwikis/resolve/main/filtered-en-test.jsonl.translated 2. https://huggingface.co/datasets/mesolitica/translated-xwikis/resolve/main/filtered-en-valid.jsonl.translated 3. https://huggingface.co/datasets/mesolitica/translated-xwikis/resolve/main/filtered-en00.splitted.translated 4. https://huggingface.co/datasets/mesolitica/translated-xwikis/resolve/main/filtered-en01.splitted.translated 5. https://huggingface.co/datasets/mesolitica/translated-xwikis/resolve/main/filtered-en02.splitted.translated 6. https://huggingface.co/datasets/mesolitica/translated-xwikis/resolve/main/filtered-en03.splitted.translated 7. https://huggingface.co/datasets/mesolitica/translated-xwikis/resolve/main/filtered-en04.splitted.translated 8. https://huggingface.co/datasets/mesolitica/translated-xwikis/resolve/main/filtered-en05.splitted.translated Citation ~~~~~~~~ .. code:: bibtex @article{DBLP:journals/corr/SeeLM17, author = {Abigail See and Peter J. Liu and Christopher D. Manning}, title = {Get To The Point: Summarization with Pointer-Generator Networks}, journal = {CoRR}, volume = {abs/1704.04368}, year = {2017}, url = {http://arxiv.org/abs/1704.04368}, archivePrefix = {arXiv}, eprint = {1704.04368}, timestamp = {Mon, 13 Aug 2018 16:46:08 +0200}, biburl = {https://dblp.org/rec/journals/corr/SeeLM17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }