crawl ===== 1Media.My --------- Crawl from https://www.1media.my/ https://github.com/users/huseinzol05/projects/1/views/1?filterQuery=1m&pane=issue&itemId=38028189 Download ~~~~~~~~ 1. https://huggingface.co/datasets/malaysia-ai/1media.my/resolve/main/1media.my-02.json 2. https://huggingface.co/datasets/malaysia-ai/1media.my/resolve/main/1media.my.json Working Notebook ~~~~~~~~~~~~~~~~ 1. https://jupyter.app.mesolitica.com/tree/za/scrapping/1Media.My 9shares.my ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-9shares/resolve/main/9shares.jsonl Academia.edu ------------ download ~~~~~~~~ 1. 15-09-2020, pdf files, https://f000.backblazeb2.com/file/malay-dataset/crawler/academia/academia.edu-v2.zip 2. extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/academia-dedup.jsonl https://agbrief.com/news/malaysia --------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/syafie-nzm/crawl-agbrief.com/resolve/main/agbrief-1.jsonl https://www.agendadaily.com/ ---------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-agendadaily/resolve/main/agendadaily.jsonl https://www.akademisains.gov.my/asmsj/published-articles/ --------------------------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/akademisains.gov.my.jsonl https://akuislam.com/ --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/akuislam.com.jsonl https://alhijrahnews.com/ ------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-malaysian-website/resolve/main/alhijrahnews-articles.jsonl amanz.my -------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-amanz-my/resolve/main/everything.jsonl 2. https://huggingface.co/datasets/mesolitica/crawl-amanz-my/resolve/main/parsed.jsonl Scrap Angkasfera (798 kB) ------------------------- Link to Dataset Repository: https://huggingface.co/datasets/hazmannaim/angkasfera_text https://www.apu.edu.my// ------------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/CarrotzRule123/crawl-apu.edu/resolve/main/apu.edu.jsonl article.poliklinikazzaara.com.my -------------------------------- asklegal.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/asklegal_articles.csv AstroAwani ---------- **The copyright data remains with the original owners of the data, do not use this data for commercial purpose.** download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-bisnes.json.nested 2. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-dunia.json.nested 3. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-hiburan.json.nested 4. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-malaysia.json.nested 5. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-politik.json.nested 6. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-sukan.json.nested 7. https://huggingface.co/datasets/mesolitica/crawl-astroawani/raw/main/berita-teknologi.json.nested 8. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/gaya-hidup.json.nested 9. https://huggingface.co/datasets/mesolitica/crawl-astroawani/resolve/main/berita-english.json Citation ~~~~~~~~ .. code:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling AstroAwani, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/astroawani}} } autobuzz.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/autobuzz.my.jsonl azhafizah.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/azhafizah.com.jsonl b.cari.com.my ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-b-cari-com-my/resolve/main/posts.parquet beautifulnara.com ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/beautifulnara.com.jsonl https://berita.rtm.gov.my/ -------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-berita-rtm/resolve/main/berita-rtm.jsonl https://bernama.com/tam/ ------------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/bernama.com-tam.jsonl Bernama ------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-bernama/resolve/main/parse-bernama.json 2. or if you want pickled requests objects, https://huggingface.co/datasets/mesolitica/crawl-bernama/resolve/main/bernama.pkl -------------- license: apache-2.0 language: - en -------------- TLDR ^^^^ - website: `bikesrepublic `__ - num. of webpages scraped: 6,969 - link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-bikesrepublic/resolve/main/bikesrepublic-scraped-data-fixed-6969-webpages.jsonl - date of scraping: 10th September 2023 - pull request: https://github.com/huseinzol05/malaysian-dataset/pull/291 bjak.my ------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/bjak.my.jsonl blog.fincrew.my --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blog.fincrew.my.jsonl blog.limkitsiang.com -------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/blog.limkitsiang.com.jsonl blog.malaysia-asia.my --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blog.malaysia-asia.my.jsonl blog.pandai.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blog.pandai.com.jsonl blog.yeahhost.com.my -------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/blog.yeahhost.com.my.jsonl blogmalaysia.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blogmalaysia.com.jsonl blogtipskerjaya.net ------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/blogtipskerjaya.net.jsonl Buku teks --------- Originally https://www.ipendidikan.my/buku-teks-digital-kssr-tahun-1-hingga-6.html and https://www.ipendidikan.my/koleksi-buku-teks-digital-asas-kssm.html download ~~~~~~~~ 1. pdf files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/buku-teks.zip 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/buku-teks.jsonl buletinmutiara.com ------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/buletinmutiara.com.jsonl bullishbursa.blogspot.com ------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/bullishbursa.blogspot.com.jsonl bumigemilang.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/bumigemilang.jsonl 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/bumigemilang.com-pdf.jsonl bumiinvest20.home.blog ---------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/bumiinvest20.home.blog.jsonl buro247.my ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/buro247.my.jsonl c.cari.com.my ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-c-cari-com-my/resolve/main/everything.jsonl Carigold -------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/politics.json 2. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/current-issues.json 3. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/santai-others.json 4. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/everything.jsonl 5. https://huggingface.co/datasets/mesolitica/crawl-carigold/resolve/main/posts.parquet carlist.my ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/carlist.my.jsonl carsifu.my ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/carsifu.my.jsonl carsome.my ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/carsome.my.jsonl cn.cari.com.my -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-cn-cari-com-my/resolve/main/everything.jsonl 2. extract and dedup, https://huggingface.co/datasets/mesolitica/crawl-cn-cari-com-my/resolve/main/dedup.jsonl columbiaasia.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/columbiaasia.com.jsonl Crossref -------- Search DOIs based on keywords. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/crossref-pdf.jsonl data.gov.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-gov.my/resolve/main/data.gov.my denaihati.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-denaihati.com https://www.dermatology.org.my/malaysia_journal.php --------------------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/dermatology.org.my.jsonl dewanbahasa.jendeladbp.my ------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/dewanbahasa.jendeladbp.my.jsonl discoverkl.com -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/discoverkl.com.jsonl diva.my ------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/diva.my.jsonl doctoroncall.com.my ------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/doctoroncall.com.my.jsonl dotproperty.com.my ------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/dotproperty.com.my.jsonl dsf.my ------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/dsf.my.jsonl e-khutbah --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/e-khutbah.jsonl https://www.e-mjm.org/past_issues.html -------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/e-mjm.org.jsonl ecentral.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ecentral.my.jsonl edu.my PDF ---------- Manually save to html from google search using ``site:edu.my filetype:pdf``. download ~~~~~~~~ 1. list of html files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/edu.my.zip 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/edu.my.jsonl ekonomirakyat.com ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ekonomirakyat.com.jsonl enanyang.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/enanyang.my.jsonl eniraimathi.blogspot.com ------------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/eniraimathi.blogspot.com.jsonl Eprints Malaysia Universities ----------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/eprints-um, 20GB 2. https://huggingface.co/datasets/mesolitica/eprints-uitm, 187GB 3. https://huggingface.co/datasets/mesolitica/eprints-usim, 21GB 4. https://huggingface.co/datasets/mesolitica/eprints-ums, 36GB 5. https://huggingface.co/datasets/mesolitica/eprints-ukm, 13GB 6. https://huggingface.co/datasets/mesolitica/eprints-unimas, 68GB 7. https://huggingface.co/datasets/mesolitica/eprints-usm, 35GB 8. https://huggingface.co/datasets/mesolitica/eprints-uia, 22GB 9. https://huggingface.co/datasets/mesolitica/eprints-uum, 8.7GB 10. https://huggingface.co/datasets/mesolitica/eprints-others, 1.1GB 11. extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/eprints-dedup.jsonl 12. postfilter dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/filtered-eprints-dedup.jsonl fintechnews.my -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/fintechnews.my.jsonl https://fliphtml5.com/ ---------------------- Crawl fliphtml5 pdf text version. Search by keyword: - Melayu download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-fliphtml/resolve/main/fliphtml-melayu.jsonl?download=true FMT --- download ~~~~~~~~ for HTML ^^^^^^^^ 1. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-0.jsonl 2. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-1.jsonl 3. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-2.jsonl 4. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-3.jsonl 5. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-4.jsonl 6. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-5.jsonl 7. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-6.jsonl 8. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-7.jsonl 9. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-8.jsonl 10. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/fmt-9.jsonl parsed ^^^^^^ 1. https://huggingface.co/datasets/mesolitica/crawl-fmt/resolve/main/parsed-fmt.jsonl Foodpanda --------- download ~~~~~~~~ 1. `foodpanda-city.json `__, Available cities and page links. .. code:: python {'Kuala Lumpur': '/city/kuala-lumpur', 'Penang': '/city/bayan-baru', 'Petaling Jaya': '/city/petaling-jaya', 'Subang': '/city/puchong', 'Shah Alam': '/city/shah-alam', 'Cyberjaya': '/city/cyberjaya', 2. `foodpanda-restaurant.json `__, Available restaurants for each cities. .. code:: python {'Kuala Lumpur': {'La Risata Bar Pizzeria Ristorante': {'star': '4.5', 'delivery': 'Free', 'characters': ['Meat', 'Pasta', 'Salad', 'Pizza', 'Italian'], 'link': '/chain/ce3iw/la-risata-bar-pizzeria-ristorante'}, 'Viapre Italian Restaurant KL': {'star': '4.3', 'delivery': 'Free', 'characters': ['Pizza'], 'link': '/chain/ck3sy/viapre-italian-restaurant-kl'}, 3. `foodpanda-foods-old.json `__, Total size is 98.7 MB. Available foods for each restaurants. .. code:: python {'La Risata Bar Pizzeria Ristorante': {'star': '4.5', 'delivery': 'Free', 'characters': ['Meat', 'Pasta', 'Salad', 'Pizza', 'Italian'], 'link': '/chain/ce3iw/la-risata-bar-pizzeria-ristorante', 'data': []}, 'Viapre Italian Restaurant KL': {'star': '4.3', 'delivery': 'Free', 'characters': ['Pizza'], 'link': '/chain/ck3sy/viapre-italian-restaurant-kl', 'data': [['Starters', {'is_half_type_available': False, 'id': 639959, 'name': 'Bresaola', 'code': 'm4yz-pr-dpsn', 'description': 'Air dry beef loin slices on fresh mozzarella, mushroom pikles, evo oil and fine balsamic', 'file_path': '', 'logo_path': '', 'half_type': None, 'is_alcoholic_item': False, 'product_variations': [{'id': Last update on 24th November 2019. 4. `foodpanda-foods.json `__, Total size is 382.1 MB. Available foods for each restaurants. Last update on 15th August 2020. fuh.my ------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/fuh.my.jsonl gamerbraves.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gamerbraves.com.jsonl gamersantai.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gamersantai.com.jsonl gamersonduty.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/gamersonduty.com.jsonl gempak.com ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gempak.com.jsonl goody25.com ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/goody25.com.jsonl goodymy.com ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/goodymy.com.jsonl google PDF ---------- manually search pdf for certain malaysia domains, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/google-search-pdf.zip download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/google-pdf.jsonl gov.my PDF ---------- Manually save to html from google search using ``site:ywm.gov.my filetype:pdf``. download ~~~~~~~~ 1. list of html files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gov.my.zip 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/gov.my.jsonl Malaysia Hansard ---------------- Originally from https://www.parlimen.gov.my/hansard-dewan-rakyat.html?uweb=dr Only pulled from 1990 until latest. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-malaysian-hansard/resolve/main/hansard.jsonl download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/ms-news-harakahdaily hardwarezone.com.sg ------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-hardwarezone-sg/resolve/main/everything.jsonl hargaemas.my ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/hargaemas.my.jsonl hellodoktor.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ubat-hellodoktor.com.jsonl 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/hellodoktor.com.jsonl https://www.heraldmalaysia.com/ ------------------------------- download ~~~~~~~~ articles: https://huggingface.co/datasets/aisyahhrazak/crawl-heraldmalaysia/resolve/main/heraldmalaysia-articles.jsonl pdf: https://huggingface.co/datasets/aisyahhrazak/crawl-heraldmalaysia/resolve/main/heraldmalaysia-pdf.jsonl hijabista.com.my ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-hijabista/resolve/main/hijabista.jsonl hostingmalaya.com ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/hostingmalaya.com.jsonl hype.my ------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/hype.my.jsonl i-fiqh ------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/i-fiqh-akta.jsonl 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/pandangan-hukum.jsonl 3. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/pandangan-pakar.jsonl 4. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/soalan-jawab-hukum.jsonl 5. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/artikel.jsonl 6. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/garis-panduan.jsonl 7. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/myhadith.islam.gov.my.jsonl ideasaham.my ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ideasaham.my.jsonl IIUM Confession --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-iium-confession/resolve/main/crawled-iium.json 2. https://huggingface.co/datasets/mesolitica/crawl-iium-confession/raw/main/url-iium.json https://ikram.org.my/ --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-ikram/resolve/main/ikram.jsonl ilifepost.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-ilifepost.com/resolve/main/ilifepost.jsonl imetech.com.my -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/imetech.com.my.jsonl https://www.impiana.my/ ----------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-impiana/resolve/main/impiana-my.jsonl infopelajar.my -------------- intraday.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/intraday.my.jsonl Ipendidikan ----------- Crawl https://www.ipendidikan.my/ to get karangan. Citation ~~~~~~~~ .. code:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling Ipendidikan, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/ipendidikan}} } Iproperty --------- download ~~~~~~~~ 1. sales all-residential, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/sales-residential.zip 2. rents all-residential, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/rents-residential.zip 3. sales all-commercial, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/sales-commercial.zip 4. rents all-commercial, https://f000.backblazeb2.com/file/malay-dataset/crawler/iproperty/rents-commercial.zip how-to-read ~~~~~~~~~~~ .. code:: python ## !unzip sales-residential.zip import json from glob import glob files = glob('sales-residential/*.json') with open(files[0]) as fopen: data = json.load(fopen) print(data['listings']['items'][0]) .. code:: text {'prices': [{'type': 'sale', 'currency': 'MYR', 'min': 899000, 'max': 899000}], 'medias': [{'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/67b989ed26ff4ec5ba3f9a8aeb842ea7.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/535d87ad364242c6b271611d6e4728fe.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/673b01c87e6449138b3211c250a383c0.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/f78c5d30d7f2484284d5acafe0b59614.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/4631b581e7e74e5599768e9fbdfd30e5.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/7e67f39e27e744a9970d7aeeba9829d8.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/cc1f2111c87e496bad8dec1a63393145.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/da909b4aa1644d958fa394ac3b97bce9.jpg', 'mimeType': 'image/jpeg'}, {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/3fd1843625e9459fbdb78a7c2d1318a9.jpg', 'mimeType': 'image/jpeg'}], 'cover': {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/640/10460/306bd1cdd0454c8ab73221d3a4b9fb23.jpg', 'mimeType': 'image/jpeg'}, 'logo': {}, 'address': {'formattedAddress': 'Jalan Sultan Ismail, KLCC, 50250, Kuala Lumpur', 'lat': 3.154365, 'lng': 101.707512, 'hasLatLng': True, 'hideMarker': False}, 'multilanguagePlace': {'en-GB': {'level1': 'Kuala Lumpur', 'level2': 'KLCC', 'level3': 'Vortex'}, 'ms-MY': {'level1': 'Kuala Lumpur', 'level2': 'KLCC', 'level3': 'Vortex'}}, 'attributes': {'builtUp': '826', 'furnishing': 'Partly Furnished', 'landTitleType': 'Unknown', 'tenure': 'Freehold', 'facingDirection': 'Unknown', 'occupancy': 'Vacant', 'titleType': 'Strata', 'sizeUnit': 'SQUARE_FEET', 'sizeUnitLandArea': 'SQUARE_FEET', 'downloadUrl': 'http://generator.iproperty.com.my/property/generate_pdf.aspx?pid=JI-weovckV81', 'buildingId': 3879}, 'organisations': [{'id': '1669', 'type': 'agency', 'name': 'Vivahomes Realty - Subang Jaya', 'logo': {'type': 'image', 'url': 'https://images-my.ippstatic.com/images/searchresult/agencybrandlogo/c73a67c8ab304a07b6475c23159bae33.png', 'thumbnailUrl': 'https://images-my.ippstatic.com/images/searchresult/agencybrandlogo/c73a67c8ab304a07b6475c23159bae33.png', 'mimeType': 'image/jpeg'}, 'color': '#80bc00', 'contact': {'phones': [{'number': '+60380811688', 'label': 'phone'}, {'number': ' 60380243288', 'label': 'fax'}]}}], 'listers': [{'id': '10460', 'type': 'agent', 'name': 'Victor', 'license': 'REN 11115', 'website': 'https://www.iproperty.com.my/property-agent/victor-10460', 'image': {'type': 'image', 'url': 'https://pictures-my.ippstatic.com/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg', 'urlTemplate': 'https://img.rea-asia.com/my-subsale/premium/${width}x${height}-${scale}/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg', 'thumbnailUrl': 'https://pictures-my.ippstatic.com/realtors/images/agent/e34da466ea47467080016b98675ce96f.jpg', 'mimeType': 'image/jpeg'}, 'contact': {'phones': [{'number': '+60132872856', 'label': 'mobile'}, {'number': '+60132872856', 'label': 'whatsapp'}], 'emails': ['vistaera@gmail.com']}}], 'active': True, 'isPrimary': False, 'channels': ['sale'], 'id': 'sale-7995132', 'kind': 'property', 'shareLink': 'https://www.iproperty.com.my/property/klcc/vortex/sale-7995132/', 'title': 'Vortex, KLCC', 'description': "VORTEX KLCC \r\nSize : 826 sq ft \r\n2 bedrooms 2 batrooms + 1 study room \r\nmiddle floor \r\nrenovated and full furnished \r\n\r\n\r\n** good deal, below market value\r\n\r\nVortex KLCC is a newly completed residences by Monoland which lies in the heart of Golden Triangle of KL. It is also a new iconic curvy round-shaped highrise building which totally bring a new breath to the skyline. Vortex is surrounded by corporate office buildings, luxury hotels and famous shopping malls. \r\n- Shangri-La hotel is right opposite Vortex KLCC \r\n- KLCC shopping mall and Pavilion mall is walking distance from Vortex KLCC \r\n- KL Tower is 3 minutes drive away from Vortex KLCC \r\n- Bukit Nanas Monorail Station is walking distance away from Vortex KLCC \r\n\r\nVortex is a freehold serviced apartment located at Jalan Sultan Ismail, at the heart of KL City. This serviced apartment is 58-storeys in height with 248 units in total. The serviced apartment's unit size starts from 744 sq.ft. \r\n\r\nThe facilities available at Vortex are clubhouse, gymnasium, lap pool, Alfresco lounge, water features, timber deck, sun lounge, steam room, sauna, chillout music pool bar and health spa. \r\n\r\nConsidered one of the best located serviced apartments nearby KLCC, Vortex is just minutes drive to Suria KLCC Shopping Centre and Pavilion Mall, all within 10-15 minutes. \r\n\r\n# contact agent Victor 013-2872856", 'tier': 3, 'isPremiumPlus': False, 'propertyType': 'Serviced Residence', 'updatedAt': '2020-06-03T05:22:00Z', 'postedAt': '2020-06-03T05:22:00Z', 'referenceCode': 'UP7995132', 'channel': 'sale', 'isSA': False} isaham.my --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/isaham.my.jsonl https://www.islam.gov.my/ms/e-penerbitan ---------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/e-penerbitan.jsonl https://ismaweb.net/ -------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-malaysian-website/resolve/main/ismaweb.jsonl isterisihat.com.my ------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/isterisihat.jsonl jbtalks.cc ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-jbtalks/resolve/main/everything.jsonl jomgaming.my ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/jomgaming.my.jsonl https://lamanweb.dbp.gov.my/jurnal/ ----------------------------------- Jurnal Dewan Bahasa dan Pustaka ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consist of 4 Jurnal, Jurnal Bahasa, Jurnal Kanun, Jurnal Melayu, Jurnal Malay Literature Total articles: 937 articles Managed to scrape: 930 articles download ~~~~~~~~ 1. https://huggingface.co/datasets/syafie-nzm/crawl-jurnaldbp/resolve/main/jurnaldbp.jsonl https://kakimuvee.net/ ---------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-kakimuvee/resolve/main/kakimuvee.jsonl kakuchopurei.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kakuchopurei.com.jsonl kamusbm.com ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kamusbm.jsonl karangan.net ------------ Crawl https://karangan.net/ to get karangan. Citation ~~~~~~~~ .. code:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Crawling karangan.net, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malay-dataset/tree/master/crawl/karangan.net}} } kaskus.co.id ------------ Originally from https://huggingface.co/acul3 download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-kaskus.co.id/resolve/main/kaskus.jsonl.7z.001 2. https://huggingface.co/datasets/mesolitica/crawl-kaskus.co.id/resolve/main/kaskus.jsonl.7z.002 3. https://huggingface.co/datasets/mesolitica/crawl-kaskus.co.id/resolve/main/kaskus.jsonl.7z.003 kebuna.com ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/kebuna.com.jsonl kebunbandar.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/kebunbandar.com.jsonl download ~~~~~~~~ https://huggingface.co/datasets/atiqnp/crawl-kelabmama/resolve/main/data.jsonl keluarga.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-keluarga/resolve/main/keluarga.jsonl kimchidaily.my -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kimchidaily.my.jsonl kisahdunia.com -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kisahdunia.com.jsonl klgadgetguy.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/klgadgetguy.com.jsonl Klook ----- kopiandproperty.com ------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/kopiandproperty.com.jsonl Kosmo ----- Added by https://github.com/tnwei download ~~~~~~~~ 1. https://huggingface.co/datasets/tnwei/ms-newspapers http://latihan-bm.blogspot.com/ ------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-katagandasepara.jsonl 2. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-kbsr-simpulanbahasa.jsonl 3. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-pepatahbidalan.jsonl 4. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-tahun-6.jsonl 5. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-tatabahasa.jsonl 6. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/latihanbm-perumpamaan.jsonl TLDR ---- * website: `leaazleeya `__ * num. of webpages: 544 * num. of webpages scraped: 544 * num. articles successfully extracted: 534 * remaing webpages to be scraped: 0 * link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-leaazleeya lipstiq.com ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/lipstiq.com.jsonl litefinance.org --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/litefinance.org.jsonl lobakmerah.com -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/lobakmerah.com.jsonl lom.agc.gov.my -------------- Originally from https://lom.agc.gov.my/ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-lom-agc-gov-my/tree/main 2. extract and dedup, https://huggingface.co/datasets/mesolitica/crawl-lom-agc-gov-my/resolve/main/dedup.jsonl Lowyat ------ download ~~~~~~~~ https://huggingface.co/datasets/mesolitica/crawl-lowyat Lyrics.my --------- Crawl from https://www.lyrics.my/ Download ~~~~~~~~ 1. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_english.json 2. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_indonesia.json 3. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_malay.json 4. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_nasyid.json 5. https://huggingface.co/datasets/amzar1303/lyrics/resolve/main/lyrics_others.json madreshoy.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/madreshoy.com.jsonl mahersaham.com -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/mahersaham.com.jsonl majalah.com ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/amirulabu/majalah-com majalahpama.my -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/majalahpama.my.jsonl https://www.majcafe.com/ ------------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/majcafe.com.jsonl majoriti.com.my --------------- makanbola.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-makanbola/resolve/main/makanbola.jsonl makkalosai.com.my ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/makkalosai.com.my.jsonl maksudperibahasa.com -------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-maksudperibahasa/resolve/main/maksudperibahasa.jsonl maktabahalbakri.com ------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/maktabahalbakri.com.jsonl malaykord.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaykord.com.jsonl Malaymail --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl00.splitted 2. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl01.splitted 3. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl02.splitted 4. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl03.splitted 5. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl04.splitted 6. https://huggingface.co/datasets/mesolitica/crawl-malaymail/resolve/main/malaymail.jsonl05.splitted -------------- language: - ms - en - zh - ta - ar -------------- * Malaysia textbook for primary and secondary school * Primary school textbook: `KSSR `__ * Secondary school textbook: `KSSM `__ * Link to dataset on `Huggingface `__ malaysia-today.net ------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-malaysia-today.net malaysia.tamilheritage.org -------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysia.tamilheritage.org.jsonl malaysiaindru.my ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysiaindru.my.jsonl malaysianow.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysianow.com.jsonl malaysiastock.biz ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysiastock.biz.jsonl malaysiatamilkalvi.com ---------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/malaysiatamilkalvi.com.jsonl maskulin.com.my --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-maskulin Keterangan ---------- * Laman sesawang: `mat-gaming `__ * Jumlah muka laman: 6 * Jumlah muka laman dikikis: 6 * Baki muka laman: 0 * Jumlah artikel: 49 maukerja.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/maukerja.my.jsonl mcp.anu.edu.au -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mcp.anu.edu.au.jsonl mediahiburan.my --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mediahiburan.my.jsonl https://medmalay.com/ --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-medmalay/resolve/main/medmalay.jsonl mingguanwanita.com.my --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-mingguanwanita https://www.mjpath.org.my/past-issue.php ---------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mjpath.org.my.jsonl https://mjpharm.org/ -------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mjpharm.org.jsonl https://www.morthoj.org/ ------------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/morthoj.org.jsonl Keterangan ---------- * Laman sesawang: `motor-malaya `__ * Jumlah muka laman: 943 * Jumlah muka laman dikikis: 943 * Baki muka laman: 0 * Jumlah artikel (dengan julat 12 artikel setiap muka laman): 11,000 * HuggingFace, https://huggingface.co/datasets/Ammar-Azman/crawl-motormalaya Progres ------- * [x] Artikel muka 1-10 * [x] Artikel muka 11-20 * [x] Artikel muka 21-30 * [x] Artikel muka 31-40 * [x] Artikel muka 41-50 * [x] Artikel muka 51-60 * [x] Artikel muka 61-70 * [x] Artikel muka 71-80 * [x] Artikel muka 81-90 * [x] Artikel muka 91-100 * [x] Artikel muka 100-200 * [x] Artikel muka 200-300 * [x] Artikel muka 300-400 * [x] Artikel muka 500-700 * [x] Artikel muka 700-943 Status ------ * Selesai https://www.motomalaysia.com/ ----------------------------- Synthetic visual chat instructions for https://www.motomalaysia.com/ download ~~~~~~~~ 1. https://huggingface.co/datasets/malaysia-ai/motomalaysia.com-multiturn/blob/main/motomalaysia-data.jsonl 2. https://huggingface.co/datasets/malaysia-ai/motomalaysia.com-multiturn/blob/main/pic.zip https://www.mps.org.my/index.cfm -------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mps.org.my.jsonl https://www.msss.com.my/mjss/ ----------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/msss.com.my.jsonl mstar.com.my ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/mstar.com.my.jsonl Dataset link ^^^^^^^^^^^^ - https://huggingface.co/datasets/Ammar-Azman/crawl-mufti-negeri-sembilan/resolve/main/mufti_negeri_sem_artikel.jsonl Dataset link ^^^^^^^^^^^^ - https://huggingface.co/datasets/Ammar-Azman/crawl-mufti-pahang/resolve/main/mufti_pahang_artikel.jsonl Dataset link ^^^^^^^^^^^^ - https://huggingface.co/datasets/Ammar-Azman/crawl-mufti-perlis/resolve/main/mufti_perlis_artikel.jsonl Link to dataset """"""""""""""" - https://huggingface.co/datasets/Ammar-Azman/mufti_wilayah muftiwp.gov.my -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/muftiwp.gov.my.jsonl murai.my -------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/murai.my.jsonl Website snapshot ---------------- how-to ~~~~~~ 1. Put necessary urls in `list.txt `__. 2. Run `run.py `__, .. code:: bash python3 run.py This script is to get all nested href. 3. Run `run.sh `__, .. code:: bash bash run.sh This script is to fetch full page for each href. download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website 2. dedup based on 428982 URLs, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/snapshot.jsonl my.theasianparent.com --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/my.theasianparent.com.jsonl myartis.com ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myartis.com.jsonl mycarforum.com -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-mycarforum-com/resolve/main/everything.jsonl mygameon.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mygameon.my.jsonl https://myjgeosc.com/ --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjgeosc.com.jsonl https://myjms.mohe.gov.my/ -------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjms.mohe.gov.my.jsonl https://myjsustainagri.com/ --------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/myjsustainagri.com.jsonl mykmu.net --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mykmu.net.jsonl mymp.my ------- Originally from https://mymp.org.my/p/khairy-jamaluddin-abu-bakar download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-mymp.my/resolve/main/mymp.pkl myresipi.com ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/myresipi.com.jsonl mysoalan.com ------------ download ~~~~~~~~ 1. pdf files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mysoalan.com-pdf.zip 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/mysoalan.com.jsonl nambikkai.com.my ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/nambikkai.com.my.jsonl nanban.com.my ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/nanban.com.my.jsonl nasilemaktech.com ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/nasilemaktech.com.jsonl https://www.newera.edu.my/publication.php?id=4805&pub=mjcs ---------------------------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/newera.edu.my.jsonl https://news.seehua.com/ ------------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-news.seehua/resolve/main/seehua.jsonl https://nextrift.com/ --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/CarrotzRule123/crawl-nextrift/resolve/main/nextrift.jsonl nona.my ------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-nona nurulzayani.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/nurulzayani.com.jsonl https://nutriweb.org.my/mjn/online-first.php -------------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/nutriweb.org.my.jsonl ohbulan.com ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/ohbulan.com.jsonl mediahiburan.my --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/ohmedia.my.jsonl ohmyhome.com ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ohmyhome.com.jsonl ohsem.me -------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/ohsem.me.jsonl OpenDOSM -------- Originally from https://open.dosm.gov.my/ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-opendosm/tree/main org.my PDF ---------- Manually save to html from google search using ``site:org.my filetype:pdf``. download ~~~~~~~~ 1. list of html files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/org.my.zip 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/org.my.jsonl orientaldaily.com.my -------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/orientaldaily.com.my.jsonl parlimen.gov.my --------------- download ~~~~~~~~ 1. pdf files, https://huggingface.co/datasets/mesolitica/crawl-parlimen-gov-my/tree/main 2. extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/parlimen-gov-dedup.jsonl paultan.org ----------- download ~~~~~~~~ 1. BM, https://huggingface.co/datasets/farhanhelmy/paultan-bm pdfdrive -------- Originally from https://twitter.com/acul_SR download ~~~~~~~~ 1. extract and dedup, https://huggingface.co/datasets/mesolitica/pdf-text-dedup/resolve/main/pdfdrive-dedup.jsonl penuntutilmu.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/penuntutilmu.com.jsonl perak.org --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-perak-org/resolve/main/everything.jsonl https://www.pgm-my.org/malaysianjournalofgenetics/ -------------------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/pgm-my.org.jsonl piston.my --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/piston.my.jsonl pokde.net --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/pokde.net.jsonl productnation.co ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/productnation.co.json propcafe.net ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/propcafe.net.jsonl Scraping PropertyGuru-EN (5.58 MB) ---------------------------------- Link to Dataset: https://huggingface.co/datasets/HiraishinEX/propertyguru-en/tree/main pt3online.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalan-pt3online.jsonl quola.my -------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/quola.my.jsonl raiz.com.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/raiz.com.my.jsonl realestatemy.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/realestatemy.com.jsonl relevan.com.my -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-relevan.com.my https://resepichenom.com/ ------------------------- Synthetic visual chat instructions for https://resepichenom.com/ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/resepichenom.com-multiturn/resolve/main/chat.json 2. https://huggingface.co/datasets/mesolitica/resepichenom.com-multiturn/resolve/main/pic.zip ricebowl.my ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ricebowl.my.jsonl ringgitohringgit.com -------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ringgitohringgit.com.jsonl ringgitplus.com --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/ringgitplus.com.jsonl rojaklah.com ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/rojaklah.com.jsonl rootofscience.com ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/moiralah/rootofscience/resolve/main/rootofscience.jsonl ruby.my ------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/ruby.my.jsonl sabahpost.net ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-sabahpost/resolve/main/sabahpost.jsonl sabrinatajudin.com ------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/sabrinatajudin.com.jsonl salary.sg --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-salary-sg/resolve/main/everything.jsonl says.com -------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/says.com.jsonl https://selangorkini.my/ta/ --------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/selangorkini.my-ta.jsonl https://senaraiperibahasa.com/ ------------------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset/resolve/main/senaraiperibahasa.com.jsonl shahbudindotcom.net ------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/shahbudindotcom.net.jsonl Shinjiru Blog ^^^^^^^^^^^^^ - https://www.shinjiru.com.my/blog Dataset link ^^^^^^^^^^^^ - https://huggingface.co/datasets/Ammar-Azman/shinjiru-blog/resolve/main/shinjiru_article.jsonl siakapkeli.my ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/siakapkeli.my.jsonl simplywall.st ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/simplywall.st.jsonl sinar.syok.my ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/sinar.syok.my.jsonl Sinar Harian ------------ Crawl from https://www.sinarharian.com.my/ Download ~~~~~~~~ 1. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_berita.json 2. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_bisnes.json 3. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_politik.json 4. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_sukan.json 5. https://huggingface.co/datasets/amzar1303/crawl-sinar-harian/blob/main/sinar_harian_link_wawancara.json sinarproject ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/politikus.json 2. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/govdocs.jsonl sinchew.com.my -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/sinchew.com.my.jsonl siraplimau.com -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/siraplimau.com.jsonl skycrapercity.com ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/skyscrapercity.com.jsonl soalanspm.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalanspm.jsonl 2. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/spm-ayatpasif-aktif.jsonl stories.my ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/stories.my.json https://story.motherhood.com.my/my/ ----------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/story.motherhood.com.my straitstimes ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/straitstimes.jsonl studentportal.my ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/studentportal.my.jsonl suamisihat.com.my ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/suamisihat.jsonl https://www.suararisda.my/blog ------------------------------ sukanz.com ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-sukanz/resolve/main/sukanz.jsonl sunahsukasakura.com ------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/sunahsukasakura.com.jsonl https://www.surah.my/ --------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/surah.my.jsonl https://tamil.goodreturns.in/topic/malaysia ------------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/syafie-nzm/crawl-tamil.goodreturns.in/resolve/main/tamilgoodreturns.jsonl tamilmurasu.com.sg ------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tamilmurasu.com.sg.jsonl tantannews.com -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tantannews.com.jsonl tcer.my ------- download ~~~~~~~~ 1. pdf files, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my-pdf.zip 2. pdf files to text, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my-pdf.jsonl 3. website, https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tcer.my.jsonl tech-critter.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tech-critter.com.jsonl https://www.techinasia.com/tag/malaysia --------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/techinasia.com.json 2. parsed, https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/techinasia.com.jsonl techlagi.my ----------- technave.com ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/technave.com.jsonl -------------- license: apache-2.0 language: - en -------------- * website: `techrakyat `__ * num. of webpages scraped: 220 * link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-techrakyat/resolve/main/techrakyat-scraped-data-fixed.jsonl tekkaus.com ----------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/tekkaus.com.jsonl teratotech.com -------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/teratotech.com.jsonl theborneopost.com ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-theborneopost https://thediagnosa.com/jenis-penyakit/ --------------------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-diagnosa/resolve/main/thediagnosa.jsonl TLDR ---- * website: `theedgemalaysia `__ * num. of webpages scraped: 432,374 (inclusive of articles with no text) * link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-theedgemalaysia * last date of scraping: 14th August 2023 * status: **complete** Note ---- The **"language" column for the data set has errors** as it miscategorizes articles in the Mandarin language. This is primarily because I was searching for the string "English version" in the text. This will need to be accounted for if type of language used is important. Methodology ----------- For `The Edge Malaysia `__, each of their articles seem to have a unique ID at the end of the url e.g., "677590" in "https://theedgemalaysia.com/node/677590". Hence, since we won't be able to do this by month, page no., etc., we'll use a **brute force** approach that tests every combination of numbers, such that we'll only scrape from a valid url. Progress -------- - [x] batch1 - [x] batch2 - [x] batch3 - [x] batch4 - [x] batch5 - [x] batch6 - [x] batch7 - [x] batch8 - [x] batch9 - [x] batch10 - [x] batch11 - [x] batch12 - [x] batch13 - [x] batch14 thekapital.my ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-thekapital/resolve/main/thekapital.jsonl The Malaysian Insights ---------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/themalaysianinsights.jsonl therakyatpost.com ----------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/therakyatpost.com.jsonl therooftalks.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/therooftalks.com.jsonl Ticket2U -------- -------------- license: apache-2.0 language: - en -------------- TLDR - website: `timchew `__ - num. of webpages scraped: 839 - link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-timchew/resolve/main/timchew-scraped-data-839-webpages.jsonl - date of scraping: 10th September 2023 tryandreview.com ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/tryandreview.com.jsonl tvpertiwi.com.my ---------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/tvpertiwi.com.my.jsonl umminani.com ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/umminani.com.jsonl umpan.com.my ------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-umpan https://upsronline.com/ ----------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-soalan/resolve/main/soalan-upsr.jsonl download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/ms-news-utusanborneo vanakkammalaysia.com.my ----------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/vanakkammalaysia.com.my.jsonl varnam.my --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/varnam.my.jsonl -------------- license: apache-2.0 language: - ta -------------- **TLDR** """""""" - website: `Vikatan-MY `__ - num. of webpages scraped: 65 (7 locked behind paywal) - link to dataset: https://huggingface.co/datasets/wanadzhar913/crawl-vikatan-my/resolve/main/vikatan-my-scraped-data.jsonl - date of scraping: 21st October 2023 - contributed to: https://github.com/mesolitica/malaysian-dataset viralcham.com ------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/viralcham.com.jsonl vocket.com ---------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/vocket.jsonl vpsmalaysia.com.my ------------------ download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/vpsmalaysia.com.my.jsonl https://wapcar.my/ ------------------ Synthetic visual chat instructions for https://wapcar.my/ download ~~~~~~~~ 1. https://huggingface.co/malaysia-ai2020/wapcar.my-multiturn/blob/main/car-data.jsonl 2. https://huggingface.co/malaysia-ai2020/wapcar.my-multiturn/blob/main/pic.zip wapcar.my --------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/raw/main/wapcar.my.jsonl https://wartaoriental.com/ -------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/aisyahhrazak/crawl-malaysian-website/resolve/main/wartaoriental.jsonl Watpadd ------- how-to ~~~~~~ 1. https://f000.backblazeb2.com/file/malay-dataset/crawler/wattpad/wattpad.zip wiser.my -------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/wiser.my.jsonl Youbaby ------- Crawl from https://youbaby.my/blog/ https://github.com/users/huseinzol05/projects/1/views/1?pane=issue&itemId=33632219 Download ~~~~~~~~ 1. https://huggingface.co/datasets/amzar1303/youbaby/resolve/main/youbabymy-data.json zenthegeek.tech --------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/zenthegeek.tech.json zulkiflihasan.wordpress.com --------------------------- download ~~~~~~~~ 1. https://huggingface.co/datasets/mesolitica/crawl-my-website/resolve/main/zulkiflihasan.wordpress.com.jsonl