dumping#

Common Crawl#

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Common Crawl,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/singlish-text}}
}

facebook#

Crawl facebook using https://github.com/kevinzg/facebook-scraper

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Malaysian Facebook pages,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/facebook}}
}

IMDA transcription#

Extracted from IMDA dataset, https://www.imda.gov.sg/

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Singlish Texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/singlish-text}}
}

Instagram#

Gathered from crawlers.

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Bahasa Instagram,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/instagram}}
}

Karangan Sekolah#

Gathered from Google Search.

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Karangan Sekolah,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/karangan-sekolah}}
}

Manglish Twitter#

Gathered from Twitter Streaming.

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Manglish Twitter,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/manglish}}
}

NLLB#

how-to#

Total size: 57.7 MB, 399251 sentences, download link.

Singapore News#

Contributed by brytjy.

download#

Total size: 213.1 MB, 1760382 sentences, https://f000.backblazeb2.com/file/malay-dataset/dumping/singlish/sg-news.txt

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Singapore News,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/singapore-news}}
}

Manglish Text#

Singlish is a mix of Chinese, Bahasa, Tamil and majority English, singaporean slang.

Random crawled from different singaporean websites and blogs.

Total size: 1.2 GB, 19870766 sentences.

Contributed by brytjy.

Citation#

@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Singlish Texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/dumping/singlish-text}}
}

Twitter Bahasa#

Contact me personally to get full data.

Download#

  1. last year,

577.5 MB, 10172726 sentences, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/dumping-twitter-6-july-2019.json

  1. 2020-02-22,

english, 136MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-02-22-twitter-dump-en.json

bahasa, 332MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-02-22-twitter-dump-in.json

  1. 2020-02-22 - 2020-02-08,

english, 138MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-03-08-twitter-dump-en.json

bahasa, 357MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-03-08-twitter-dump-in.json

  1. 2020-02-08, 2020-03-28,

english, 96MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-03-28-twitter-dump-en.json

bahasa, 261MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-03-28-twitter-dump-in.json

  1. 2020-03-28 - 2020-04-12

english, 108.1MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-04-12-twitter-dump-en.json

bahasa, 323.3MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-04-12-twitter-dump-in.json

  1. 2020-04-12 - 2020-04-22

english, 72.5MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-04-22-twitter-dump-en.json

bahasa, 261MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-04-22-twitter-dump-in.json

  1. 2020-04-22 - 2020-05-02

english, 73.6MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-05-02-twitter-dump-en.json

bahasa, 219.2MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-05-02-twitter-dump-in.json

  1. 2020-05-02 - 2020-05-11

english, 67.9MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-05-11-twitter-dump-en.json

bahasa, 213.4MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-05-11-twitter-dump-in.json

  1. 2020-05-11 - 2020-05-31

english, 142.2MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-05-31-twitter-dump-en.json

bahasa, 386.6MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/2020-05-31-twitter-dump-in.json

  1. 2021-03-06 - 2021-04-21

bahasa, 533MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/compiled-2021-03-06-twitter.tar

  1. 2021-04-21 - 2021-06-06

bahasa, 778MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/compiled-2021-04-21-twitter.tar

  1. 2021-06-06 - 2021-07-23

bahasa, 1.3GB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/compiled-2021-06-06-twitter.tar

  1. 2021-07-23 - 2022-06-08

bahasa, 593MB, https://f000.backblazeb2.com/file/malay-dataset/dumping/twitter/compiled-2022-06-08-twitter.tar

  1. last snapshot, https://huggingface.co/mesolitica/snapshot-twitter-2022-09-03

  • minimum timestamp, 2022-04-17T16:30:07.000Z

  • maximum timestamp, 2022-09-03T09:23:52.000Z

  • 7075025 rows

  • full attributes

{
"datetime": "2022-04-18T05:57:04",
"datetime_gmt8": "2022-04-18T13:57:04",
"data_text": "kekal halal kak https://t.co/YHKqszqPnS",
"body": "kekal halal kak https://t.co/YHKqszqPnS",
"screen_name": "Luke_Sebastian2",
"followers_count": 10413,
"friends_count": 72,
"listed_count": 6,
"favourites_count": 1494,
"statuses_count": 948,
"quoted_status_text": "NULL",
"lang": "in",
"retweet": "false",
"retweet_text": "NULL",
"retweet_text_full": "NULL",
"retweet_count": 0,
"retweet_detail": {},
"quote_count": 0,
"favorite_count": 0,
"reply_count": 0,
"id_str": "1515932406368202753",
"tweet": {
"created_at": "Mon Apr 18 05:57:04 +0000 2022",
"id": 1515932406368202800,
"id_str": "1515932406368202753",
"text": "kekal halal kak๐Ÿ˜๐Ÿคซ https://t.co/YHKqszqPnS",
"display_text_range": [
0,
17
],
"source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>",
"truncated": false,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
"in_reply_to_user_id_str": null,
"in_reply_to_screen_name": null,
"user": {
"id": 1431086333024374800,
"id_str": "1431086333024374792",
"name": "โ˜„สŸแดœแด‹แด‡",
"screen_name": "Luke_Sebastian2",
"location": "Malaysia",
"url": "http://t.me/Luke_Alqamara",
"description": "|๐Ÿฎ๐Ÿฌ๐Ÿฐ|โšค|๐Ÿ“š๐—ง๐—ผ๐—ฝ|๐Ÿ‡ฎ๐Ÿ‡ฉ|๐Ÿ“Œ๐Ÿ‡ฒ๐Ÿ‡พ|Law Student๐Ÿ’ผ|โ€ข๐ค๐ž๐ค๐š๐ฌ๐ข๐ก๐ค๐ฎ:@Trevor_Louise1โ€ข|Dm me for endorsement/Collab and rates also๐Ÿ“ฉ!|โ€ขdon't forget to smile๐Ÿ˜Šโ€ข",
"translator_type": "none",
"protected": false,
"verified": false,
"followers_count": 10413,
"friends_count": 72,
"listed_count": 6,
"favourites_count": 1494,
"statuses_count": 948,
"created_at": "Fri Aug 27 02:49:28 +0000 2021",
"utc_offset": null,
"time_zone": null,
"geo_enabled": true,
"lang": null,
"contributors_enabled": false,
"is_translator": false,
"profile_background_color": "F5F8FA",
"profile_background_image_url": "",
"profile_background_image_url_https": "",
"profile_background_tile": false,
"profile_link_color": "1DA1F2",
"profile_sidebar_border_color": "C0DEED",
"profile_sidebar_fill_color": "DDEEF6",
"profile_text_color": "333333",
"profile_use_background_image": true,
"profile_image_url": "http://pbs.twimg.com/profile_images/1500850780823494658/snCdyeen_normal.jpg",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/1500850780823494658/snCdyeen_normal.jpg",
"profile_banner_url": "https://pbs.twimg.com/profile_banners/1431086333024374792/1647061513",
"default_profile": true,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null,
"withheld_in_countries": []
},
"geo": null,
"coordinates": null,
"place": {
"id": "7b02fbddf4d9f2c6",
"url": "https://api.twitter.com/1.1/geo/id/7b02fbddf4d9f2c6.json",
"place_type": "city",
"name": "Kuala Lumpur City",
"full_name": "Kuala Lumpur City, Kuala Lumpur Federal Territory",
"country_code": "MY",
"country": "Malaysia",
"bounding_box": {
"type": "Polygon",
"coordinates": [
[
[
101.668232,
3.104906
],
[
101.668232,
3.192155
],
[
101.742378,
3.192155
],
[
101.742378,
3.104906
]
]
]
},
"attributes": {}
},
"contributors": null,
"is_quote_status": false,
"quote_count": 0,
"reply_count": 0,
"retweet_count": 0,
"favorite_count": 0,
"entities": {
"hashtags": [],
"urls": [],
"user_mentions": [],
"symbols": [],
"media": [
{
"id": 1515932334612107300,
"id_str": "1515932334612107268",
"indices": [
18,
41
],
"additional_media_info": {
"monetizable": false
},
"media_url": "http://pbs.twimg.com/ext_tw_video_thumb/1515932334612107268/pu/img/ak2K23DgNDDV-UCC.jpg",
"media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/1515932334612107268/pu/img/ak2K23DgNDDV-UCC.jpg",
"url": "https://t.co/YHKqszqPnS",
"display_url": "pic.twitter.com/YHKqszqPnS",
"expanded_url": "https://twitter.com/Luke_Sebastian2/status/1515932406368202753/video/1",
"type": "photo",
"sizes": {
"thumb": {
"w": 150,
"h": 150,
"resize": "crop"
},
"medium": {
"w": 540,
"h": 960,
"resize": "fit"
},
"small": {
"w": 383,
"h": 680,
"resize": "fit"
},
"large": {
"w": 540,
"h": 960,
"resize": "fit"
}
}
}
]
},
"extended_entities": {
"media": [
{
"id": 1515932334612107300,
"id_str": "1515932334612107268",
"indices": [
18,
41
],
"additional_media_info": {
"monetizable": false
},
"media_url": "http://pbs.twimg.com/ext_tw_video_thumb/1515932334612107268/pu/img/ak2K23DgNDDV-UCC.jpg",
"media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/1515932334612107268/pu/img/ak2K23DgNDDV-UCC.jpg",
"url": "https://t.co/YHKqszqPnS",
"display_url": "pic.twitter.com/YHKqszqPnS",
"expanded_url": "https://twitter.com/Luke_Sebastian2/status/1515932406368202753/video/1",
"type": "video",
"video_info": {
"aspect_ratio": [
9,
16
],
"duration_millis": 15232,
"variants": [
{
"bitrate": 632000,
"content_type": "video/mp4",
"url": "https://video.twimg.com/ext_tw_video/1515932334612107268/pu/vid/320x568/3gN3Udy0BrbU8HFr.mp4?tag=12"
},
{
"content_type": "application/x-mpegURL",
"url": "https://video.twimg.com/ext_tw_video/1515932334612107268/pu/pl/V6UZr3a49tZHwoia.m3u8?tag=12&container=fmp4"
},
{
"bitrate": 950000,
"content_type": "video/mp4",
"url": "https://video.twimg.com/ext_tw_video/1515932334612107268/pu/vid/480x852/CpA6Jht3IZjzh75X.mp4?tag=12"
},
{
"bitrate": 2176000,
"content_type": "video/mp4",
"url": "https://video.twimg.com/ext_tw_video/1515932334612107268/pu/vid/540x960/EdWN9mo8jIbA5PDM.mp4?tag=12"
}
]
},
"sizes": {
"thumb": {
"w": 150,
"h": 150,
"resize": "crop"
},
"medium": {
"w": 540,
"h": 960,
"resize": "fit"
},
"small": {
"w": 383,
"h": 680,
"resize": "fit"
},
"large": {
"w": 540,
"h": 960,
"resize": "fit"
}
}
}
]
},
"favorited": false,
"retweeted": false,
"possibly_sensitive": false,
"filter_level": "low",
"lang": "in",
"timestamp_ms": "1650261424997",
"ignore_lang": true
},
"type": "search"
}
  • stream filtered by geo boundary,

stream.filter(
locations=[
99.8568959909,
0.8232449017,
119.5213933664,
7.2037547089,
]
)
  1. dedup, https://huggingface.co/datasets/mesolitica/twitter-dedup/resolve/main/dedup-twitter.jsonl