crawl
=====
1Media.My
---------
Crawl from https://www.1media.my/ https://github.com/users/huseinzol05/projects/1/views/1?filterQuery=1m&pane=issue&itemId=38028189
Download
~~~~~~~~
1. https://huggingface.co/datasets/malaysia-ai/1media.my/resolve/main/1media.my-02.json
2. https://huggingface.co/datasets/malaysia-ai/1media.my/resolve/main/1media.my.json
Working Notebook
~~~~~~~~~~~~~~~~
1. https://jupyter.app.mesolitica.com/tree/za/scrapping/1Media.My
9shares.my
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-9shares/resolve/main/9shares.jsonl
Academia.edu
------------
download
~~~~~~~~
1. 15-09-2020, pdf files, https://f000.backblazeb2.com/file/malay-dataset/crawler/academia/academia.edu-v2.zip
2. extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/academia-dedup.jsonl
https://agbrief.com/news/malaysia
---------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/syafie-nzm/crawl-agbrief.com/resolve/main/agbrief-1.jsonl
https://www.agendadaily.com/
----------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-agendadaily/resolve/main/agendadaily.jsonl
https://www.akademisains.gov.my/asmsj/published-articles/
---------------------------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/akademisains.gov.my.jsonl
https://akuislam.com/
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/akuislam.com.jsonl
https://alhijrahnews.com/
-------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-malaysian-website/resolve/main/alhijrahnews-articles.jsonl
amanz.my
--------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-amanz-my/resolve/main/everything.jsonl
2. https://huggingface.co/datasets/mesolitica/crawl-amanz-my/resolve/main/parsed.jsonl
Scrap Angkasfera (798 kB)
-------------------------
Link to Dataset Repository: https://huggingface.co/datasets/hazmannaim/angkasfera_text
https://www.apu.edu.my//
------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/CarrotzRule123/crawl-apu.edu/resolve/main/apu.edu.jsonl
article.poliklinikazzaara.com.my
--------------------------------
asklegal.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/asklegal_articles.csv
AstroAwani
----------
**The copyright data remains with the original owners of the data, do not use this data for commercial purpose.**
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-bisnes.json.nested
2. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-dunia.json.nested
3. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-hiburan.json.nested
4. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-malaysia.json.nested
5. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-politik.json.nested
6. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-sukan.json.nested
7. https://huggingface.co/datasets/mesolitica/crawl-astroawani/raw/main/berita-teknologi.json.nested
8. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/gaya-hidup.json.nested
9. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-english.json
Citation
~~~~~~~~
.. code:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling AstroAwani,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/astroawani}}
}
autobuzz.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/autobuzz.my.jsonl
azhafizah.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/azhafizah.com.jsonl
b.cari.com.my
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-b-cari-com-my/resolve/main/posts.parquet
beautifulnara.com
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/beautifulnara.com.jsonl
https://berita.rtm.gov.my/
--------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-berita-rtm/resolve/main/berita-rtm.jsonl
https://bernama.com/tam/
------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/bernama.com-tam.jsonl
Bernama
-------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-bernama/resolve/main/parse-bernama.json
2. or if you want pickled requests objects, https://huggingface.co/datasets/mesolitica/crawl-bernama/resolve/main/bernama.pkl
--------------
license: apache-2.0 language:
- en
--------------
TLDR
^^^^
- website: `bikesrepublic `__
- num. of webpages scraped: 6,969
- link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-bikesrepublic/resolve/main/bikesrepublic-scraped-data-fixed-6969-webpages.jsonl
- date of scraping: 10th September 2023
- pull request: https://github.com/huseinzol05/malaysian-dataset/pull/291
bjak.my
-------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/bjak.my.jsonl
blog.fincrew.my
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blog.fincrew.my.jsonl
blog.limkitsiang.com
--------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/blog.limkitsiang.com.jsonl
blog.malaysia-asia.my
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blog.malaysia-asia.my.jsonl
blog.pandai.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blog.pandai.com.jsonl
blog.yeahhost.com.my
--------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/blog.yeahhost.com.my.jsonl
blogmalaysia.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blogmalaysia.com.jsonl
blogtipskerjaya.net
-------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blogtipskerjaya.net.jsonl
Buku teks
---------
Originally https://www.ipendidikan.my/buku-teks-digital-kssr-tahun-1-hingga-6.html and https://www.ipendidikan.my/koleksi-buku-teks-digital-asas-kssm.html
download
~~~~~~~~
1. pdf files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/buku-teks.zip
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/buku-teks.jsonl
buletinmutiara.com
------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/buletinmutiara.com.jsonl
bullishbursa.blogspot.com
-------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/bullishbursa.blogspot.com.jsonl
bumigemilang.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/bumigemilang.jsonl
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/bumigemilang.com-pdf.jsonl
bumiinvest20.home.blog
----------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/bumiinvest20.home.blog.jsonl
buro247.my
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/buro247.my.jsonl
c.cari.com.my
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-c-cari-com-my/resolve/main/everything.jsonl
Carigold
--------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/politics.json
2. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/current-issues.json
3. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/santai-others.json
4. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/everything.jsonl
5. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/posts.parquet
carlist.my
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/carlist.my.jsonl
carsifu.my
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/carsifu.my.jsonl
carsome.my
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/carsome.my.jsonl
cn.cari.com.my
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-cn-cari-com-my/resolve/main/everything.jsonl
2. extract and dedup, https://huggingface.co/datasets/mesolitica/crawl-cn-cari-com-my/resolve/main/dedup.jsonl
columbiaasia.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/columbiaasia.com.jsonl
Crossref
--------
Search DOIs based on keywords.
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/crossref-pdf.jsonl
data.gov.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-gov.my/resolve/main/data.gov.my
denaihati.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-denaihati.com
https://www.dermatology.org.my/malaysia_journal.php
---------------------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/dermatology.org.my.jsonl
dewanbahasa.jendeladbp.my
-------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/dewanbahasa.jendeladbp.my.jsonl
discoverkl.com
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/discoverkl.com.jsonl
diva.my
-------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/diva.my.jsonl
doctoroncall.com.my
-------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/doctoroncall.com.my.jsonl
dotproperty.com.my
------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/dotproperty.com.my.jsonl
dsf.my
------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/dsf.my.jsonl
e-khutbah
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/e-khutbah.jsonl
https://www.e-mjm.org/past_issues.html
--------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/e-mjm.org.jsonl
ecentral.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ecentral.my.jsonl
edu.my PDF
----------
Manually save to html from google search using ``site:edu.my filetype:pdf``.
download
~~~~~~~~
1. list of html files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/edu.my.zip
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/edu.my.jsonl
ekonomirakyat.com
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ekonomirakyat.com.jsonl
enanyang.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/enanyang.my.jsonl
eniraimathi.blogspot.com
------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/eniraimathi.blogspot.com.jsonl
Eprints Malaysia Universities
-----------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/eprints-um, 20GB
2. https://huggingface.co/datasets/mesolitica/eprints-uitm, 187GB
3. https://huggingface.co/datasets/mesolitica/eprints-usim, 21GB
4. https://huggingface.co/datasets/mesolitica/eprints-ums, 36GB
5. https://huggingface.co/datasets/mesolitica/eprints-ukm, 13GB
6. https://huggingface.co/datasets/mesolitica/eprints-unimas, 68GB
7. https://huggingface.co/datasets/mesolitica/eprints-usm, 35GB
8. https://huggingface.co/datasets/mesolitica/eprints-uia, 22GB
9. https://huggingface.co/datasets/mesolitica/eprints-uum, 8.7GB
10. https://huggingface.co/datasets/mesolitica/eprints-others, 1.1GB
11. extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/eprints-dedup.jsonl
12. postfilter dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/filtered-eprints-dedup.jsonl
fintechnews.my
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/fintechnews.my.jsonl
https://fliphtml5.com/
----------------------
Crawl fliphtml5 pdf text version. Search by keyword:
- Melayu
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-fliphtml/resolve/main/fliphtml-melayu.jsonl?download=true
FMT
---
download
~~~~~~~~
for HTML
^^^^^^^^
1. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-0.jsonl
2. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-1.jsonl
3. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-2.jsonl
4. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-3.jsonl
5. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-4.jsonl
6. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-5.jsonl
7. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-6.jsonl
8. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-7.jsonl
9. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-8.jsonl
10. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-9.jsonl
parsed
^^^^^^
1. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/parsed-fmt.jsonl
Foodpanda
---------
download
~~~~~~~~
1. `foodpanda-city.json `__,
Available cities and page links.
.. code:: python
{'Kuala Lumpur': '/city/kuala-lumpur',
'Penang': '/city/bayan-baru',
'Petaling Jaya': '/city/petaling-jaya',
'Subang': '/city/puchong',
'Shah Alam': '/city/shah-alam',
'Cyberjaya': '/city/cyberjaya',
2. `foodpanda-restaurant.json `__,
Available restaurants for each cities.
.. code:: python
{'Kuala Lumpur': {'La Risata Bar Pizzeria Ristorante': {'star': '4.5',
'delivery': 'Free',
'characters': ['Meat', 'Pasta', 'Salad', 'Pizza', 'Italian'],
'link': '/chain/ce3iw/la-risata-bar-pizzeria-ristorante'},
'Viapre Italian Restaurant KL': {'star': '4.3',
'delivery': 'Free',
'characters': ['Pizza'],
'link': '/chain/ck3sy/viapre-italian-restaurant-kl'},
3. `foodpanda-foods-old.json `__,
Total size is 98.7 MB.
Available foods for each restaurants.
.. code:: python
{'La Risata Bar Pizzeria Ristorante': {'star': '4.5',
'delivery': 'Free',
'characters': ['Meat', 'Pasta', 'Salad', 'Pizza', 'Italian'],
'link': '/chain/ce3iw/la-risata-bar-pizzeria-ristorante',
'data': []},
'Viapre Italian Restaurant KL': {'star': '4.3',
'delivery': 'Free',
'characters': ['Pizza'],
'link': '/chain/ck3sy/viapre-italian-restaurant-kl',
'data': [['Starters',
{'is_half_type_available': False,
'id': 639959,
'name': 'Bresaola',
'code': 'm4yz-pr-dpsn',
'description': 'Air dry beef loin slices on fresh mozzarella, mushroom pikles, evo oil and fine balsamic',
'file_path': '',
'logo_path': '',
'half_type': None,
'is_alcoholic_item': False,
'product_variations': [{'id':
Last update on 24th November 2019.
4. `foodpanda-foods.json `__,
Total size is 382.1 MB.
Available foods for each restaurants.
Last update on 15th August 2020.
fuh.my
------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/fuh.my.jsonl
gamerbraves.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gamerbraves.com.jsonl
gamersantai.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gamersantai.com.jsonl
gamersonduty.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/gamersonduty.com.jsonl
gempak.com
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gempak.com.jsonl
goody25.com
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/goody25.com.jsonl
goodymy.com
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/goodymy.com.jsonl
google PDF
----------
manually search pdf for certain malaysia domains, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/google-search-pdf.zip
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/google-pdf.jsonl
gov.my PDF
----------
Manually save to html from google search using ``site:ywm.gov.my filetype:pdf``.
download
~~~~~~~~
1. list of html files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gov.my.zip
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gov.my.jsonl
Malaysia Hansard
----------------
Originally from https://www.parlimen.gov.my/hansard-dewan-rakyat.html?uweb=dr
Only pulled from 1990 until latest.
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-malaysian-hansard/resolve/main/hansard.jsonl
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/ms-news-harakahdaily
hardwarezone.com.sg
-------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-hardwarezone-sg/resolve/main/everything.jsonl
hargaemas.my
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/hargaemas.my.jsonl
hellodoktor.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ubat-hellodoktor.com.jsonl
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/hellodoktor.com.jsonl
https://www.heraldmalaysia.com/
-------------------------------
download
~~~~~~~~
articles: https://huggingface.co/datasets/aisyahhrazak/crawl-heraldmalaysia/resolve/main/heraldmalaysia-articles.jsonl pdf: https://huggingface.co/datasets/aisyahhrazak/crawl-heraldmalaysia/resolve/main/heraldmalaysia-pdf.jsonl
hijabista.com.my
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-hijabista/resolve/main/hijabista.jsonl
hostingmalaya.com
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/hostingmalaya.com.jsonl
hype.my
-------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/hype.my.jsonl
i-fiqh
------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/i-fiqh-akta.jsonl
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/pandangan-hukum.jsonl
3. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/pandangan-pakar.jsonl
4. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/soalan-jawab-hukum.jsonl
5. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/artikel.jsonl
6. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/garis-panduan.jsonl
7. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/myhadith.islam.gov.my.jsonl
ideasaham.my
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ideasaham.my.jsonl
IIUM Confession
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-iium-confession/resolve/main/crawled-iium.json
2. https://huggingface.co/datasets/mesolitica/crawl-iium-confession/raw/main/url-iium.json
https://ikram.org.my/
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-ikram/resolve/main/ikram.jsonl
ilifepost.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-ilifepost.com/resolve/main/ilifepost.jsonl
imetech.com.my
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/imetech.com.my.jsonl
https://www.impiana.my/
-----------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-impiana/resolve/main/impiana-my.jsonl
infopelajar.my
--------------
intraday.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/intraday.my.jsonl
Ipendidikan
-----------
Crawl https://www.ipendidikan.my/ to get karangan.
Citation
~~~~~~~~
.. code:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling Ipendidikan,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/ipendidikan}}
}
Iproperty
---------
download
~~~~~~~~
1. sales all-residential, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/sales-residential.zip
2. rents all-residential, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/rents-residential.zip
3. sales all-commercial, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/sales-commercial.zip
4. rents all-commercial, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/rents-commercial.zip
how-to-read
~~~~~~~~~~~
.. code:: python
## !unzip sales-residential.zip
import json
from glob import glob
files = glob('sales-residential/*.json')
with open(files[0]) as fopen:
data = json.load(fopen)
print(data['listings']['items'][0])
.. code:: text
{'prices': [{'type': 'sale', 'currency': 'MYR', 'min': 899000, 'max': 899000}],
'medias': [{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg',
'mimeType': 'image/jpeg'},
{'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg',
'mimeType': 'image/jpeg'}],
'cover': {'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg',
'mimeType': 'image/jpeg'},
'logo': {},
'address': {'formattedAddress': 'Jalan Sultan Ismail, KLCC, 50250, Kuala Lumpur',
'lat': 3.154365,
'lng': 101.707512,
'hasLatLng': True,
'hideMarker': False},
'multilanguagePlace': {'en-GB': {'level1': 'Kuala Lumpur',
'level2': 'KLCC',
'level3': 'Vortex'},
'ms-MY': {'level1': 'Kuala Lumpur', 'level2': 'KLCC', 'level3': 'Vortex'}},
'attributes': {'builtUp': '826',
'furnishing': 'Partly Furnished',
'landTitleType': 'Unknown',
'tenure': 'Freehold',
'facingDirection': 'Unknown',
'occupancy': 'Vacant',
'titleType': 'Strata',
'sizeUnit': 'SQUARE_FEET',
'sizeUnitLandArea': 'SQUARE_FEET',
'downloadUrl': 'http://generator.iproperty.com.my/property/generate_pdf.aspx?pid=JI-weovckV81',
'buildingId': 3879},
'organisations': [{'id': '1669',
'type': 'agency',
'name': 'Vivahomes Realty - Subang Jaya',
'logo': {'type': 'image',
'url': 'https://images-my.ippstatic.com/images/searchresult/agencybrandlogo/c73a67c8ab304a07b6475c23159bae33.png',
'thumbnailUrl': 'https://images-my.ippstatic.com/images/searchresult/agencybrandlogo/c73a67c8ab304a07b6475c23159bae33.png',
'mimeType': 'image/jpeg'},
'color': '#80bc00',
'contact': {'phones': [{'number': '+60380811688', 'label': 'phone'},
{'number': ' 60380243288', 'label': 'fax'}]}}],
'listers': [{'id': '10460',
'type': 'agent',
'name': 'Victor',
'license': 'REN 11115',
'website': 'https://www.iproperty.com.my/property-agent/victor-10460',
'image': {'type': 'image',
'url': 'https://pictures-my.ippstatic.com/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg',
'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg',
'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg',
'mimeType': 'image/jpeg'},
'contact': {'phones': [{'number': '+60132872856', 'label': 'mobile'},
{'number': '+60132872856', 'label': 'whatsapp'}],
'emails': ['vistaera@gmail.com']}}],
'active': True,
'isPrimary': False,
'channels': ['sale'],
'id': 'sale-7995132',
'kind': 'property',
'shareLink': 'https://www.iproperty.com.my/property/klcc/vortex/sale-7995132/',
'title': 'Vortex, KLCC',
'description': "VORTEX KLCC \r\nSize : 826 sq ft \r\n2 bedrooms 2 batrooms + 1 study room \r\nmiddle floor \r\nrenovated and full furnished \r\n\r\n\r\n** good deal, below market value\r\n\r\nVortex KLCC is a newly completed residences by Monoland which lies in the heart of Golden Triangle of KL. It is also a new iconic curvy round-shaped highrise building which totally bring a new breath to the skyline. Vortex is surrounded by corporate office buildings, luxury hotels and famous shopping malls. \r\n- Shangri-La hotel is right opposite Vortex KLCC \r\n- KLCC shopping mall and Pavilion mall is walking distance from Vortex KLCC \r\n- KL Tower is 3 minutes drive away from Vortex KLCC \r\n- Bukit Nanas Monorail Station is walking distance away from Vortex KLCC \r\n\r\nVortex is a freehold serviced apartment located at Jalan Sultan Ismail, at the heart of KL City. This serviced apartment is 58-storeys in height with 248 units in total. The serviced apartment's unit size starts from 744 sq.ft. \r\n\r\nThe facilities available at Vortex are clubhouse, gymnasium, lap pool, Alfresco lounge, water features, timber deck, sun lounge, steam room, sauna, chillout music pool bar and health spa. \r\n\r\nConsidered one of the best located serviced apartments nearby KLCC, Vortex is just minutes drive to Suria KLCC Shopping Centre and Pavilion Mall, all within 10-15 minutes. \r\n\r\n# contact agent Victor 013-2872856",
'tier': 3,
'isPremiumPlus': False,
'propertyType': 'Serviced Residence',
'updatedAt': '2020-06-03T05:22:00Z',
'postedAt': '2020-06-03T05:22:00Z',
'referenceCode': 'UP7995132',
'channel': 'sale',
'isSA': False}
isaham.my
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/isaham.my.jsonl
https://www.islam.gov.my/ms/e-penerbitan
----------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/e-penerbitan.jsonl
https://ismaweb.net/
--------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-malaysian-website/resolve/main/ismaweb.jsonl
isterisihat.com.my
------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/isterisihat.jsonl
jbtalks.cc
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-jbtalks/resolve/main/everything.jsonl
jomgaming.my
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/jomgaming.my.jsonl
https://lamanweb.dbp.gov.my/jurnal/
-----------------------------------
Jurnal Dewan Bahasa dan Pustaka
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Consist of 4 Jurnal, Jurnal Bahasa, Jurnal Kanun, Jurnal Melayu, Jurnal Malay Literature
Total articles: 937 articles Managed to scrape: 930 articles
download
~~~~~~~~
1. https://huggingface.co/datasets/syafie-nzm/crawl-jurnaldbp/resolve/main/jurnaldbp.jsonl
https://kakimuvee.net/
----------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-kakimuvee/resolve/main/kakimuvee.jsonl
kakuchopurei.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kakuchopurei.com.jsonl
kamusbm.com
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kamusbm.jsonl
karangan.net
------------
Crawl https://karangan.net/ to get karangan.
Citation
~~~~~~~~
.. code:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling karangan.net,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/karangan.net}}
}
kaskus.co.id
------------
Originally from https://huggingface.co/acul3
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-kaskus.co.id/resolve/main/kaskus.jsonl.7z.001
2. https://huggingface.co/datasets/mesolitica/crawl-kaskus.co.id/resolve/main/kaskus.jsonl.7z.002
3. https://huggingface.co/datasets/mesolitica/crawl-kaskus.co.id/resolve/main/kaskus.jsonl.7z.003
kebuna.com
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/kebuna.com.jsonl
kebunbandar.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/kebunbandar.com.jsonl
download
~~~~~~~~
https://huggingface.co/datasets/atiqnp/crawl-kelabmama/resolve/main/data.jsonl
keluarga.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-keluarga/resolve/main/keluarga.jsonl
kimchidaily.my
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kimchidaily.my.jsonl
kisahdunia.com
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kisahdunia.com.jsonl
klgadgetguy.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/klgadgetguy.com.jsonl
Klook
-----
kopiandproperty.com
-------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kopiandproperty.com.jsonl
Kosmo
-----
Added by https://github.com/tnwei
download
~~~~~~~~
1. https://huggingface.co/datasets/tnwei/ms-newspapers
http://latihan-bm.blogspot.com/
-------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-katagandasepara.jsonl
2. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-kbsr-simpulanbahasa.jsonl
3. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-pepatahbidalan.jsonl
4. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-tahun-6.jsonl
5. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-tatabahasa.jsonl
6. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-perumpamaan.jsonl
TLDR
----
* website: `leaazleeya `__
* num. of webpages: 544
* num. of webpages scraped: 544
* num. articles successfully extracted: 534
* remaing webpages to be scraped: 0
* link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-leaazleeya
lipstiq.com
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/lipstiq.com.jsonl
litefinance.org
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/litefinance.org.jsonl
lobakmerah.com
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/lobakmerah.com.jsonl
lom.agc.gov.my
--------------
Originally from https://lom.agc.gov.my/
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-lom-agc-gov-my/tree/main
2. extract and dedup, https://huggingface.co/datasets/mesolitica/crawl-lom-agc-gov-my/resolve/main/dedup.jsonl
Lowyat
------
download
~~~~~~~~
https://huggingface.co/datasets/mesolitica/crawl-lowyat
Lyrics.my
---------
Crawl from https://www.lyrics.my/
Download
~~~~~~~~
1. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_english.json
2. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_indonesia.json
3. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_malay.json
4. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_nasyid.json
5. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_others.json
madreshoy.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/madreshoy.com.jsonl
mahersaham.com
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/mahersaham.com.jsonl
majalah.com
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/amirulabu/majalah-com
majalahpama.my
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/majalahpama.my.jsonl
https://www.majcafe.com/
------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/majcafe.com.jsonl
majoriti.com.my
---------------
makanbola.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-makanbola/resolve/main/makanbola.jsonl
makkalosai.com.my
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/makkalosai.com.my.jsonl
maksudperibahasa.com
--------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-maksudperibahasa/resolve/main/maksudperibahasa.jsonl
maktabahalbakri.com
-------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/maktabahalbakri.com.jsonl
malaykord.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaykord.com.jsonl
Malaymail
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl00.splitted
2. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl01.splitted
3. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl02.splitted
4. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl03.splitted
5. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl04.splitted
6. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl05.splitted
--------------
language:
- ms
- en
- zh
- ta
- ar
--------------
* Malaysia textbook for primary and secondary school
* Primary school textbook: `KSSR `__
* Secondary school textbook: `KSSM `__
* Link to dataset on `Huggingface `__
malaysia-today.net
------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-malaysia-today.net
malaysia.tamilheritage.org
--------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysia.tamilheritage.org.jsonl
malaysiaindru.my
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysiaindru.my.jsonl
malaysianow.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysianow.com.jsonl
malaysiastock.biz
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysiastock.biz.jsonl
malaysiatamilkalvi.com
----------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysiatamilkalvi.com.jsonl
maskulin.com.my
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-maskulin
Keterangan
----------
* Laman sesawang: `mat-gaming `__
* Jumlah muka laman: 6
* Jumlah muka laman dikikis: 6
* Baki muka laman: 0
* Jumlah artikel: 49
maukerja.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/maukerja.my.jsonl
mcp.anu.edu.au
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mcp.anu.edu.au.jsonl
mediahiburan.my
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mediahiburan.my.jsonl
https://medmalay.com/
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-medmalay/resolve/main/medmalay.jsonl
mingguanwanita.com.my
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-mingguanwanita
https://www.mjpath.org.my/past-issue.php
----------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mjpath.org.my.jsonl
https://mjpharm.org/
--------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mjpharm.org.jsonl
https://www.morthoj.org/
------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/morthoj.org.jsonl
Keterangan
----------
* Laman sesawang: `motor-malaya `__
* Jumlah muka laman: 943
* Jumlah muka laman dikikis: 943
* Baki muka laman: 0
* Jumlah artikel (dengan julat 12 artikel setiap muka laman): 11,000
* HuggingFace, https://huggingface.co/datasets/Ammar-Azman/crawl-motormalaya
Progres
-------
* [x] Artikel muka 1-10
* [x] Artikel muka 11-20
* [x] Artikel muka 21-30
* [x] Artikel muka 31-40
* [x] Artikel muka 41-50
* [x] Artikel muka 51-60
* [x] Artikel muka 61-70
* [x] Artikel muka 71-80
* [x] Artikel muka 81-90
* [x] Artikel muka 91-100
* [x] Artikel muka 100-200
* [x] Artikel muka 200-300
* [x] Artikel muka 300-400
* [x] Artikel muka 500-700
* [x] Artikel muka 700-943
Status
------
* Selesai
https://www.motomalaysia.com/
-----------------------------
Synthetic visual chat instructions for https://www.motomalaysia.com/
download
~~~~~~~~
1. https://huggingface.co/datasets/malaysia-ai/motomalaysia.com-multiturn/blob/main/motomalaysia-data.jsonl
2. https://huggingface.co/datasets/malaysia-ai/motomalaysia.com-multiturn/blob/main/pic.zip
https://www.mps.org.my/index.cfm
--------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mps.org.my.jsonl
https://www.msss.com.my/mjss/
-----------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/msss.com.my.jsonl
mstar.com.my
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/mstar.com.my.jsonl
Dataset link
^^^^^^^^^^^^
- https://huggingface.co/datasets/Ammar-Azman/crawl-mufti-negeri-sembilan/resolve/main/mufti_negeri_sem_artikel.jsonl
Dataset link
^^^^^^^^^^^^
- https://huggingface.co/datasets/Ammar-Azman/crawl-mufti-pahang/resolve/main/mufti_pahang_artikel.jsonl
Dataset link
^^^^^^^^^^^^
- https://huggingface.co/datasets/Ammar-Azman/crawl-mufti-perlis/resolve/main/mufti_perlis_artikel.jsonl
Link to dataset
"""""""""""""""
- https://huggingface.co/datasets/Ammar-Azman/mufti_wilayah
muftiwp.gov.my
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/muftiwp.gov.my.jsonl
murai.my
--------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/murai.my.jsonl
Website snapshot
----------------
how-to
~~~~~~
1. Put necessary urls in `list.txt `__.
2. Run `run.py `__,
.. code:: bash
python3 run.py
This script is to get all nested href.
3. Run `run.sh `__,
.. code:: bash
bash run.sh
This script is to fetch full page for each href.
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website
2. dedup based on 428982 URLs, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/snapshot.jsonl
my.theasianparent.com
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/my.theasianparent.com.jsonl
myartis.com
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myartis.com.jsonl
mycarforum.com
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-mycarforum-com/resolve/main/everything.jsonl
mygameon.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mygameon.my.jsonl
https://myjgeosc.com/
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjgeosc.com.jsonl
https://myjms.mohe.gov.my/
--------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjms.mohe.gov.my.jsonl
https://myjsustainagri.com/
---------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjsustainagri.com.jsonl
mykmu.net
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mykmu.net.jsonl
mymp.my
-------
Originally from https://mymp.org.my/p/khairy-jamaluddin-abu-bakar
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-mymp.my/resolve/main/mymp.pkl
myresipi.com
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/myresipi.com.jsonl
mysoalan.com
------------
download
~~~~~~~~
1. pdf files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mysoalan.com-pdf.zip
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mysoalan.com.jsonl
nambikkai.com.my
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/nambikkai.com.my.jsonl
nanban.com.my
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/nanban.com.my.jsonl
nasilemaktech.com
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/nasilemaktech.com.jsonl
https://www.newera.edu.my/publication.php?id=4805&pub=mjcs
----------------------------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/newera.edu.my.jsonl
https://news.seehua.com/
------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-news.seehua/resolve/main/seehua.jsonl
https://nextrift.com/
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/CarrotzRule123/crawl-nextrift/resolve/main/nextrift.jsonl
nona.my
-------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-nona
nurulzayani.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/nurulzayani.com.jsonl
https://nutriweb.org.my/mjn/online-first.php
--------------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/nutriweb.org.my.jsonl
ohbulan.com
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/ohbulan.com.jsonl
mediahiburan.my
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/ohmedia.my.jsonl
ohmyhome.com
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ohmyhome.com.jsonl
ohsem.me
--------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/ohsem.me.jsonl
OpenDOSM
--------
Originally from https://open.dosm.gov.my/
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-opendosm/tree/main
org.my PDF
----------
Manually save to html from google search using ``site:org.my filetype:pdf``.
download
~~~~~~~~
1. list of html files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/org.my.zip
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/org.my.jsonl
orientaldaily.com.my
--------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/orientaldaily.com.my.jsonl
parlimen.gov.my
---------------
download
~~~~~~~~
1. pdf files, https://huggingface.co/datasets/mesolitica/crawl-parlimen-gov-my/tree/main
2. extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/parlimen-gov-dedup.jsonl
paultan.org
-----------
download
~~~~~~~~
1. BM, https://huggingface.co/datasets/farhanhelmy/paultan-bm
pdfdrive
--------
Originally from https://twitter.com/acul_SR
download
~~~~~~~~
1. extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/pdfdrive-dedup.jsonl
penuntutilmu.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/penuntutilmu.com.jsonl
perak.org
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-perak-org/resolve/main/everything.jsonl
https://www.pgm-my.org/malaysianjournalofgenetics/
--------------------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/pgm-my.org.jsonl
piston.my
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/piston.my.jsonl
pokde.net
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/pokde.net.jsonl
productnation.co
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/productnation.co.json
propcafe.net
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/propcafe.net.jsonl
Scraping PropertyGuru-EN (5.58 MB)
----------------------------------
Link to Dataset: https://huggingface.co/datasets/HiraishinEX/propertyguru-en/tree/main
pt3online.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalan-pt3online.jsonl
quola.my
--------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/quola.my.jsonl
raiz.com.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/raiz.com.my.jsonl
realestatemy.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/realestatemy.com.jsonl
relevan.com.my
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-relevan.com.my
https://resepichenom.com/
-------------------------
Synthetic visual chat instructions for https://resepichenom.com/
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/resepichenom.com-multiturn/resolve/main/chat.json
2. https://huggingface.co/datasets/mesolitica/resepichenom.com-multiturn/resolve/main/pic.zip
ricebowl.my
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ricebowl.my.jsonl
ringgitohringgit.com
--------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ringgitohringgit.com.jsonl
ringgitplus.com
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/ringgitplus.com.jsonl
rojaklah.com
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/rojaklah.com.jsonl
rootofscience.com
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/moiralah/rootofscience/resolve/main/rootofscience.jsonl
ruby.my
-------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ruby.my.jsonl
sabahpost.net
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-sabahpost/resolve/main/sabahpost.jsonl
sabrinatajudin.com
------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/sabrinatajudin.com.jsonl
salary.sg
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-salary-sg/resolve/main/everything.jsonl
says.com
--------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/says.com.jsonl
https://selangorkini.my/ta/
---------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/selangorkini.my-ta.jsonl
https://senaraiperibahasa.com/
------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset/resolve/main/senaraiperibahasa.com.jsonl
shahbudindotcom.net
-------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/shahbudindotcom.net.jsonl
Shinjiru Blog
^^^^^^^^^^^^^
- https://www.shinjiru.com.my/blog
Dataset link
^^^^^^^^^^^^
- https://huggingface.co/datasets/Ammar-Azman/shinjiru-blog/resolve/main/shinjiru_article.jsonl
siakapkeli.my
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/siakapkeli.my.jsonl
simplywall.st
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/simplywall.st.jsonl
sinar.syok.my
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/sinar.syok.my.jsonl
Sinar Harian
------------
Crawl from https://www.sinarharian.com.my/
Download
~~~~~~~~
1. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_berita.json
2. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_bisnes.json
3. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_politik.json
4. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_sukan.json
5. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_wawancara.json
sinarproject
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/politikus.json
2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/govdocs.jsonl
sinchew.com.my
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/sinchew.com.my.jsonl
siraplimau.com
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/siraplimau.com.jsonl
skycrapercity.com
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/skyscrapercity.com.jsonl
soalanspm.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalanspm.jsonl
2. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/spm-ayatpasif-aktif.jsonl
stories.my
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/stories.my.json
https://story.motherhood.com.my/my/
-----------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/story.motherhood.com.my
straitstimes
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/straitstimes.jsonl
studentportal.my
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/studentportal.my.jsonl
suamisihat.com.my
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/suamisihat.jsonl
https://www.suararisda.my/blog
------------------------------
sukanz.com
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-sukanz/resolve/main/sukanz.jsonl
sunahsukasakura.com
-------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/sunahsukasakura.com.jsonl
https://www.surah.my/
---------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/surah.my.jsonl
https://tamil.goodreturns.in/topic/malaysia
-------------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/syafie-nzm/crawl-tamil.goodreturns.in/resolve/main/tamilgoodreturns.jsonl
tamilmurasu.com.sg
------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tamilmurasu.com.sg.jsonl
tantannews.com
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tantannews.com.jsonl
tcer.my
-------
download
~~~~~~~~
1. pdf files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my-pdf.zip
2. pdf files to text, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my-pdf.jsonl
3. website, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my.jsonl
tech-critter.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tech-critter.com.jsonl
https://www.techinasia.com/tag/malaysia
---------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/techinasia.com.json
2. parsed, https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/techinasia.com.jsonl
techlagi.my
-----------
technave.com
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/technave.com.jsonl
--------------
license: apache-2.0 language:
- en
--------------
* website: `techrakyat `__
* num. of webpages scraped: 220
* link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-techrakyat/resolve/main/techrakyat-scraped-data-fixed.jsonl
tekkaus.com
-----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/tekkaus.com.jsonl
teratotech.com
--------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/teratotech.com.jsonl
theborneopost.com
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-theborneopost
https://thediagnosa.com/jenis-penyakit/
---------------------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-diagnosa/resolve/main/thediagnosa.jsonl
TLDR
----
* website: `theedgemalaysia `__
* num. of webpages scraped: 432,374 (inclusive of articles with no text)
* link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-theedgemalaysia
* last date of scraping: 14th August 2023
* status: **complete**
Note
----
The **"language" column for the data set has errors** as it miscategorizes articles in the Mandarin language. This is primarily because I was searching for the string "English version" in the text. This will need to be accounted for if type of language used is important.
Methodology
-----------
For `The Edge Malaysia `__, each of their articles seem to have a unique ID at the end of the url e.g., "677590" in "https://theedgemalaysia.com/node/677590". Hence, since we won't be able to do this by month, page no., etc., we'll use a **brute force** approach that tests every combination of numbers, such that we'll only scrape from a valid url.
Progress
--------
- [x] batch1
- [x] batch2
- [x] batch3
- [x] batch4
- [x] batch5
- [x] batch6
- [x] batch7
- [x] batch8
- [x] batch9
- [x] batch10
- [x] batch11
- [x] batch12
- [x] batch13
- [x] batch14
thekapital.my
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-thekapital/resolve/main/thekapital.jsonl
The Malaysian Insights
----------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/themalaysianinsights.jsonl
therakyatpost.com
-----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/therakyatpost.com.jsonl
therooftalks.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/therooftalks.com.jsonl
Ticket2U
--------
--------------
license: apache-2.0 language:
- en
--------------
TLDR
- website: `timchew `__
- num. of webpages scraped: 839
- link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-timchew/resolve/main/timchew-scraped-data-839-webpages.jsonl
- date of scraping: 10th September 2023
tryandreview.com
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tryandreview.com.jsonl
tvpertiwi.com.my
----------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/tvpertiwi.com.my.jsonl
umminani.com
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/umminani.com.jsonl
umpan.com.my
------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-umpan
https://upsronline.com/
-----------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalan-upsr.jsonl
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/ms-news-utusanborneo
vanakkammalaysia.com.my
-----------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/vanakkammalaysia.com.my.jsonl
varnam.my
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/varnam.my.jsonl
--------------
license: apache-2.0 language:
- ta
--------------
**TLDR**
""""""""
- website: `Vikatan-MY `__
- num. of webpages scraped: 65 (7 locked behind paywal)
- link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-vikatan-my/resolve/main/vikatan-my-scraped-data.jsonl
- date of scraping: 21st October 2023
- contributed to: https://github.com/mesolitica/malaysian-dataset
viralcham.com
-------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/viralcham.com.jsonl
vocket.com
----------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/vocket.jsonl
vpsmalaysia.com.my
------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/vpsmalaysia.com.my.jsonl
https://wapcar.my/
------------------
Synthetic visual chat instructions for https://wapcar.my/
download
~~~~~~~~
1. https://huggingface.co/malaysia-ai2020/wapcar.my-multiturn/blob/main/car-data.jsonl
2. https://huggingface.co/malaysia-ai2020/wapcar.my-multiturn/blob/main/pic.zip
wapcar.my
---------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/wapcar.my.jsonl
https://wartaoriental.com/
--------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/aisyahhrazak/crawl-malaysian-website/resolve/main/wartaoriental.jsonl
Watpadd
-------
how-to
~~~~~~
1. https://f000.backblazeb2.com/file/malay-dataset/crawler/wattpad/wattpad.zip
wiser.my
--------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/wiser.my.jsonl
Youbaby
-------
Crawl from https://youbaby.my/blog/ https://github.com/users/huseinzol05/projects/1/views/1?pane=issue&itemId=33632219
Download
~~~~~~~~
1. https://huggingface.co/datasets/amzar1303/youbaby/resolve/main/youbabymy-data.json
zenthegeek.tech
---------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/zenthegeek.tech.json
zulkiflihasan.wordpress.com
---------------------------
download
~~~~~~~~
1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/zulkiflihasan.wordpress.com.jsonl