crawl
Contents
crawl#
1Media.My#
Crawl from https://www.1media.my/ https://github.com/users/huseinzol05/projects/1/views/1?filterQuery=1m&pane=issue&itemId=38028189
Download#
Working Notebook#
amanz.my#
Scrap Angkasfera (798 kB)#
Link to Dataset Repository: https://huggingface.co/datasets/hazmannaim/angkasfera_text
article.poliklinikazzaara.com.my#
asklegal.my#
AstroAwani#
The copyright data remains with the original owners of the data, do not use this data for commercial purpose.
download#
https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-bisnes.json.nested
https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-dunia.json.nested
https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-hiburan.json.nested
https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-malaysia.json.nested
https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-politik.json.nested
https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-sukan.json.nested
https://huggingface.co/datasets/mesolitica/crawl-astroawani/raw/main/berita-teknologi.json.nested
https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/gaya-hidup.json.nested
https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-english.json
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling AstroAwani,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/astroawani}}
}
autobuzz.my#
azhafizah.com#
b.cari.com.my#
beautifulnara.com#
Bernama#
download#
https://huggingface.co/datasets/mesolitica/crawl-bernama/resolve/main/parse-bernama.json
or if you want pickled requests objects, https://huggingface.co/datasets/mesolitica/crawl-bernama/resolve/main/bernama.pkl
license: apache-2.0 language:
en
TLDR#
website: bikesrepublic
num. of webpages scraped: 6,969
link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-bikesrepublic/resolve/main/bikesrepublic-scraped-data-fixed-6969-webpages.jsonl
date of scraping: 10th September 2023
pull request: https://github.com/huseinzol05/malaysian-dataset/pull/291
bjak.my#
blog.fincrew.my#
blog.limkitsiang.com#
blog.malaysia-asia.my#
blog.pandai.com#
blog.yeahhost.com.my#
blogmalaysia.com#
blogtipskerjaya.net#
Buku teks#
Originally https://www.ipendidikan.my/buku-teks-digital-kssr-tahun-1-hingga-6.html and https://www.ipendidikan.my/koleksi-buku-teks-digital-asas-kssm.html
buletinmutiara.com#
bullishbursa.blogspot.com#
bumigemilang.com#
bumiinvest20.home.blog#
buro247.my#
c.cari.com.my#
Carigold#
download#
https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/politics.json
https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/current-issues.json
https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/santai-others.json
https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/everything.jsonl
https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/posts.parquet
carlist.my#
carsifu.my#
carsome.my#
cn.cari.com.my#
columbiaasia.com#
Crossref#
Search DOIs based on keywords.
data.gov.my#
denaihati.com#
dewanbahasa.jendeladbp.my#
discoverkl.com#
diva.my#
doctoroncall.com.my#
dotproperty.com.my#
dsf.my#
e-khutbah#
ecentral.my#
edu.my PDF#
Manually save to html from google search using site:edu.my filetype:pdf
.
ekonomirakyat.com#
enanyang.my#
eniraimathi.blogspot.com#
Eprints Malaysia Universities#
download#
https://huggingface.co/datasets/mesolitica/eprints-uitm, 187GB
https://huggingface.co/datasets/mesolitica/eprints-usim, 21GB
https://huggingface.co/datasets/mesolitica/eprints-ums, 36GB
https://huggingface.co/datasets/mesolitica/eprints-ukm, 13GB
https://huggingface.co/datasets/mesolitica/eprints-unimas, 68GB
https://huggingface.co/datasets/mesolitica/eprints-usm, 35GB
https://huggingface.co/datasets/mesolitica/eprints-uia, 22GB
https://huggingface.co/datasets/mesolitica/eprints-uum, 8.7GB
https://huggingface.co/datasets/mesolitica/eprints-others, 1.1GB
extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/eprints-dedup.jsonl
postfilter dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/filtered-eprints-dedup.jsonl
fintechnews.my#
FMT#
download#
for HTML#
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-0.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-1.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-2.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-3.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-4.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-5.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-6.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-7.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-8.jsonl
https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-9.jsonl
Foodpanda#
download#
Available cities and page links.
{'Kuala Lumpur': '/city/kuala-lumpur',
'Penang': '/city/bayan-baru',
'Petaling Jaya': '/city/petaling-jaya',
'Subang': '/city/puchong',
'Shah Alam': '/city/shah-alam',
'Cyberjaya': '/city/cyberjaya',
Available restaurants for each cities.
{'Kuala Lumpur': {'La Risata Bar Pizzeria Ristorante': {'star': '4.5',
'delivery': 'Free',
'characters': ['Meat', 'Pasta', 'Salad', 'Pizza', 'Italian'],
'link': '/chain/ce3iw/la-risata-bar-pizzeria-ristorante'},
'Viapre Italian Restaurant KL': {'star': '4.3',
'delivery': 'Free',
'characters': ['Pizza'],
'link': '/chain/ck3sy/viapre-italian-restaurant-kl'},
Total size is 98.7 MB.
Available foods for each restaurants.
{'La Risata Bar Pizzeria Ristorante': {'star': '4.5',
'delivery': 'Free',
'characters': ['Meat', 'Pasta', 'Salad', 'Pizza', 'Italian'],
'link': '/chain/ce3iw/la-risata-bar-pizzeria-ristorante',
'data': []},
'Viapre Italian Restaurant KL': {'star': '4.3',
'delivery': 'Free',
'characters': ['Pizza'],
'link': '/chain/ck3sy/viapre-italian-restaurant-kl',
'data': [['Starters',
{'is_half_type_available': False,
'id': 639959,
'name': 'Bresaola',
'code': 'm4yz-pr-dpsn',
'description': 'Air dry beef loin slices on fresh mozzarella, mushroom pikles, evo oil and fine balsamic',
'file_path': '',
'logo_path': '',
'half_type': None,
'is_alcoholic_item': False,
'product_variations': [{'id':
Last update on 24th November 2019.
Total size is 382.1 MB.
Available foods for each restaurants.
Last update on 15th August 2020.
gamerbraves.com#
gamersantai.com#
gamersonduty.com#
gempak.com#
goody25.com#
goodymy.com#
google PDF#
manually search pdf for certain malaysia domains, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/google-search-pdf.zip
gov.my PDF#
Manually save to html from google search using site:ywm.gov.my filetype:pdf
.
Malaysia Hansard#
Originally from https://www.parlimen.gov.my/hansard-dewan-rakyat.html?uweb=dr
Only pulled from 1990 until latest.
download#
hardwarezone.com.sg#
hargaemas.my#
hellodoktor.com#
hijabista.com.my#
hostingmalaya.com#
hype.my#
i-fiqh#
download#
https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/i-fiqh-akta.jsonl
https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/pandangan-hukum.jsonl
https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/pandangan-pakar.jsonl
https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/soalan-jawab-hukum.jsonl
https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/artikel.jsonl
https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/garis-panduan.jsonl
https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/myhadith.islam.gov.my.jsonl
ideasaham.my#
IIUM Confession#
ilifepost.com#
imetech.com.my#
infopelajar.my#
intraday.my#
Ipendidikan#
Crawl https://www.ipendidikan.my/ to get karangan.
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling Ipendidikan,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/ipendidikan}}
}
Iproperty#
download#
sales all-residential, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/sales-residential.zip
rents all-residential, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/rents-residential.zip
sales all-commercial, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/sales-commercial.zip
rents all-commercial, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/rents-commercial.zip
how-to-read#
## !unzip sales-residential.zip
import json
from glob import glob
files = glob('sales-residential/*.json')
with open(files[0]) as fopen:
data = json.load(fopen)
print(data['listings']['items'][0])
{'prices': [{'type': 'sale', 'currency': 'MYR', 'min': 899000, 'max': 899000}],
'medias': [{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg',
'mimeType': 'image/jpeg'}],
'cover': {'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'mimeType': 'image/jpeg'},
'logo': {},
'address': {'formattedAddress': 'Jalan Sultan Ismail, KLCC, 50250, Kuala Lumpur',
'lat': 3.154365,
'lng': 101.707512,
'hasLatLng': True,
'hideMarker': False},
'multilanguagePlace': {'en-GB': {'level1': 'Kuala Lumpur',
'level2': 'KLCC',
'level3': 'Vortex'},
'ms-MY': {'level1': 'Kuala Lumpur', 'level2': 'KLCC', 'level3': 'Vortex'}},
'attributes': {'builtUp': '826',
'furnishing': 'Partly Furnished',
'landTitleType': 'Unknown',
'tenure': 'Freehold',
'facingDirection': 'Unknown',
'occupancy': 'Vacant',
'titleType': 'Strata',
'sizeUnit': 'SQUARE_FEET',
'sizeUnitLandArea': 'SQUARE_FEET',
'downloadUrl': 'http://generator.iproperty.com.my/property/generate_pdf.aspx?pid=JI-weovckV81',
'buildingId': 3879},
'organisations': [{'id': '1669',
'type': 'agency',
'name': 'Vivahomes Realty - Subang Jaya',
'logo': {'type': 'image',
'url': 'https://images-my.ippstatic.com/images/searchresult/agencybrandlogo/c73a67c8ab304a07b6475c23159bae33.png',
'thumbnailUrl': 'https://images-my.ippstatic.com/images/searchresult/agencybrandlogo/c73a67c8ab304a07b6475c23159bae33.png',
'mimeType': 'image/jpeg'},
'color': '#80bc00',
'contact': {'phones': [{'number': '+60380811688', 'label': 'phone'},
{'number': ' 60380243288', 'label': 'fax'}]}}],
'listers': [{'id': '10460',
'type': 'agent',
'name': 'Victor',
'license': 'REN 11115',
'website': 'https://www.iproperty.com.my/property-agent/victor-10460',
'image': {'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg',
'mimeType': 'image/jpeg'},
'contact': {'phones': [{'number': '+60132872856', 'label': 'mobile'},
{'number': '+60132872856', 'label': 'whatsapp'}],
'emails': ['vistaera@gmail.com']}}],
'active': True,
'isPrimary': False,
'channels': ['sale'],
'id': 'sale-7995132',
'kind': 'property',
'shareLink': 'https://www.iproperty.com.my/property/klcc/vortex/sale-7995132/',
'title': 'Vortex, KLCC',
'description': "VORTEX KLCC \r\nSize : 826 sq ft \r\n2 bedrooms 2 batrooms + 1 study room \r\nmiddle floor \r\nrenovated and full furnished \r\n\r\n\r\n** good deal, below market value\r\n\r\nVortex KLCC is a newly completed residences by Monoland which lies in the heart of Golden Triangle of KL. It is also a new iconic curvy round-shaped highrise building which totally bring a new breath to the skyline. Vortex is surrounded by corporate office buildings, luxury hotels and famous shopping malls. \r\n- Shangri-La hotel is right opposite Vortex KLCC \r\n- KLCC shopping mall and Pavilion mall is walking distance from Vortex KLCC \r\n- KL Tower is 3 minutes drive away from Vortex KLCC \r\n- Bukit Nanas Monorail Station is walking distance away from Vortex KLCC \r\n\r\nVortex is a freehold serviced apartment located at Jalan Sultan Ismail, at the heart of KL City. This serviced apartment is 58-storeys in height with 248 units in total. The serviced apartment's unit size starts from 744 sq.ft. \r\n\r\nThe facilities available at Vortex are clubhouse, gymnasium, lap pool, Alfresco lounge, water features, timber deck, sun lounge, steam room, sauna, chillout music pool bar and health spa. \r\n\r\nConsidered one of the best located serviced apartments nearby KLCC, Vortex is just minutes drive to Suria KLCC Shopping Centre and Pavilion Mall, all within 10-15 minutes. \r\n\r\n# contact agent Victor 013-2872856",
'tier': 3,
'isPremiumPlus': False,
'propertyType': 'Serviced Residence',
'updatedAt': '2020-06-03T05:22:00Z',
'postedAt': '2020-06-03T05:22:00Z',
'referenceCode': 'UP7995132',
'channel': 'sale',
'isSA': False}
isaham.my#
isterisihat.com.my#
jbtalks.cc#
jomgaming.my#
https://lamanweb.dbp.gov.my/jurnal/#
Jurnal Dewan Bahasa dan Pustaka#
Consist of 4 Jurnal, Jurnal Bahasa, Jurnal Kanun, Jurnal Melayu, Jurnal Malay Literature
Total articles: 937 articles Managed to scrape: 930 articles
kakuchopurei.com#
kamusbm.com#
karangan.net#
Crawl https://karangan.net/ to get karangan.
Citation#
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling karangan.net,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/karangan.net}}
}
kaskus.co.id#
Originally from https://huggingface.co/acul3
kebuna.com#
keluarga.my#
kimchidaily.my#
kisahdunia.com#
klgadgetguy.com#
Klook#
kopiandproperty.com#
Kosmo#
Added by https://github.com/tnwei
http://latihan-bm.blogspot.com/#
download#
TLDR#
website: leaazleeya
num. of webpages: 544
num. of webpages scraped: 544
num. articles successfully extracted: 534
remaing webpages to be scraped: 0
link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-leaazleeya
lipstiq.com#
litefinance.org#
lobakmerah.com#
lom.agc.gov.my#
Originally from https://lom.agc.gov.my/
Lowyat#
Lyrics.my#
Crawl from https://www.lyrics.my/
Download#
https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_english.json
https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_indonesia.json
https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_malay.json
https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_nasyid.json
https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_others.json
madreshoy.com#
mahersaham.com#
majalah.com#
majalahpama.my#
majoriti.com.my#
makanbola.com#
makkalosai.com.my#
maksudperibahasa.com#
maktabahalbakri.com#
malaykord.com#
Malaymail#
download#
https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl00.splitted
https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl01.splitted
https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl02.splitted
https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl03.splitted
https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl04.splitted
https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl05.splitted
language:
ms
en
zh
ta
ar
Malaysia textbook for primary and secondary school
Primary school textbook: KSSR
Secondary school textbook: KSSM
Link to dataset on Huggingface
malaysia-today.net#
malaysia.tamilheritage.org#
malaysiaindru.my#
malaysianow.com#
malaysiastock.biz#
malaysiatamilkalvi.com#
maskulin.com.my#
Keterangan#
Laman sesawang: mat-gaming
Jumlah muka laman: 6
Jumlah muka laman dikikis: 6
Baki muka laman: 0
Jumlah artikel: 49
maukerja.my#
mcp.anu.edu.au#
mediahiburan.my#
mingguanwanita.com.my#
Keterangan#
Laman sesawang: motor-malaya
Jumlah muka laman: 943
Jumlah muka laman dikikis: 943
Baki muka laman: 0
Jumlah artikel (dengan julat 12 artikel setiap muka laman): 11,000
HuggingFace, https://huggingface.co/datasets/Ammar-Azman/crawl-motormalaya
Progres#
[x] Artikel muka 1-10
[x] Artikel muka 11-20
[x] Artikel muka 21-30
[x] Artikel muka 31-40
[x] Artikel muka 41-50
[x] Artikel muka 51-60
[x] Artikel muka 61-70
[x] Artikel muka 71-80
[x] Artikel muka 81-90
[x] Artikel muka 91-100
[x] Artikel muka 100-200
[x] Artikel muka 200-300
[x] Artikel muka 300-400
[x] Artikel muka 500-700
[x] Artikel muka 700-943
Status#
Selesai
https://www.motomalaysia.com/#
Synthetic visual chat instructions for https://www.motomalaysia.com/
muftiwp.gov.my#
murai.my#
Website snapshot#
how-to#
python3 run.py
This script is to get all nested href.
Run run.sh,
bash run.sh
This script is to fetch full page for each href.
download#
my.theasianparent.com#
myartis.com#
mycarforum.com#
mygameon.my#
mykmu.net#
mymp.my#
Originally from https://mymp.org.my/p/khairy-jamaluddin-abu-bakar
myresipi.com#
mysoalan.com#
nambikkai.com.my#
nanban.com.my#
nasilemaktech.com#
nona.my#
nurulzayani.com#
ohbulan.com#
mediahiburan.my#
ohmyhome.com#
ohsem.me#
OpenDOSM#
Originally from https://open.dosm.gov.my/
org.my PDF#
Manually save to html from google search using site:org.my filetype:pdf
.
orientaldaily.com.my#
parlimen.gov.my#
paultan.org#
penuntutilmu.com#
perak.org#
piston.my#
pokde.net#
productnation.co#
propcafe.net#
Scraping PropertyGuru-EN (5.58 MB)#
Link to Dataset: https://huggingface.co/datasets/HiraishinEX/propertyguru-en/tree/main
pt3online.com#
quola.my#
raiz.com.my#
realestatemy.com#
relevan.com.my#
https://resepichenom.com/#
Synthetic visual chat instructions for https://resepichenom.com/
ricebowl.my#
ringgitohringgit.com#
ringgitplus.com#
rojaklah.com#
rootofscience.com#
ruby.my#
sabahpost.net#
sabrinatajudin.com#
salary.sg#
says.com#
shahbudindotcom.net#
siakapkeli.my#
simplywall.st#
sinar.syok.my#
Sinar Harian#
Crawl from https://www.sinarharian.com.my/
Download#
https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_berita.json
https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_bisnes.json
https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_sukan.json
sinarproject#
sinchew.com.my#
siraplimau.com#
skycrapercity.com#
soalanspm.com#
stories.my#
straitstimes#
studentportal.my#
suamisihat.com.my#
sukanz.com#
sunahsukasakura.com#
tamilmurasu.com.sg#
tantannews.com#
tcer.my#
download#
pdf files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my-pdf.zip
pdf files to text, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my-pdf.jsonl
website, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my.jsonl
tech-critter.com#
techlagi.my#
tekkaus.com#
teratotech.com#
theborneopost.com#
TLDR#
website: theedgemalaysia
num. of webpages scraped: 432,374 (inclusive of articles with no text)
link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-theedgemalaysia
last date of scraping: 14th August 2023
status: complete
Note#
The “language” column for the data set has errors as it miscategorizes articles in the Mandarin language. This is primarily because I was searching for the string “English version” in the text. This will need to be accounted for if type of language used is important.
Methodology#
For The Edge Malaysia, each of their articles seem to have a unique ID at the end of the url e.g., “677590” in “https://theedgemalaysia.com/node/677590”. Hence, since we won’t be able to do this by month, page no., etc., we’ll use a brute force approach that tests every combination of numbers, such that we’ll only scrape from a valid url.
Progress#
[x] batch1
[x] batch2
[x] batch3
[x] batch4
[x] batch5
[x] batch6
[x] batch7
[x] batch8
[x] batch9
[x] batch10
[x] batch11
[x] batch12
[x] batch13
[x] batch14
thekapital.my#
The Malaysian Insights#
therakyatpost.com#
therooftalks.com#
Ticket2U#
license: apache-2.0 language:
en
TLDR
website: timchew
num. of webpages scraped: 839
link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-timchew/resolve/main/timchew-scraped-data-839-webpages.jsonl
date of scraping: 10th September 2023
tryandreview.com#
tvpertiwi.com.my#
umminani.com#
umpan.com.my#
vanakkammalaysia.com.my#
varnam.my#
download#
license: apache-2.0 language:
ta
website: Vikatan-MY
num. of webpages scraped: 65 (7 locked behind paywal)
link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-vikatan-my/resolve/main/vikatan-my-scraped-data.jsonl
date of scraping: 21st October 2023
contributed to: https://github.com/mesolitica/malaysian-dataset
viralcham.com#
vocket.com#
vpsmalaysia.com.my#
https://wapcar.my/#
Synthetic visual chat instructions for https://wapcar.my/
wapcar.my#
wiser.my#
Youbaby#
Crawl from https://youbaby.my/blog/ https://github.com/users/huseinzol05/projects/1/views/1?pane=issue&itemId=33632219