Welcome to Malaysian-Dataset’s documentation!
Contents
Welcome to Malaysian-Dataset’s documentation!#
Malaysian-Dataset, We gather Malaysian dataset!
Most of the folders are lack of README, so it is better to read from https://huggingface.co/mesolitica
Documentation#
Proper documentation is available at https://malaysian-dataset.readthedocs.io
How we gather dataset?#
Crawling#
Contributors heavily crawled Malaysian websites, you can check out the full list of crawled websites at https://github.com/users/huseinzol05/projects/1
Translation#
We use Google Translate.
We use ChatGPT.
We use Malaya translation, https://huggingface.co/mesolitica/translation-t5-small-standard-bahasa-cased-v2
Semisupervised#
Teacher-student#
Supervised small samples and then trained a base model.
Trained base model predict larger samples, retrain next student models on high confident labelled data.
Repeat.
LLM#
Generate using ChatGPT3.5, ChatGPT4 and Mixtral.
Notes#
Any missing
mp.py
, get it at https://gist.github.com/huseinzol05/98974ae8c6c7a65d4bc0af9f5003786aAny missing python scripts, please contact me ASAP or create an issue.
Please at least email us first before distributing these data. Remember all these hard workings we want to give it for free.
What do you see just the data, but nobody can see how much we spent our cost to make it public.
Suggestion#
Feel free to contact me to request new dataset.
Feel free to open an issue if the link to dataset is forbidden, sometime I forgot to make it open to public.
Non-commercial Usage#
A lot of data here semisupervised / translated / tagged / decoded using third party software, example, Google Translate, Google Speech, so to avoid any future complication, it is better not use this data for commercial purposes but allow for certain research purposes.
Acknowledgement#
Thanks to Im Big, LigBlou, Mesolitica and KeyReply for sponsoring AWS Google and private cloud to deploy distributed crawlers.
Contents:#
- chatbot
- LLaVA-Pretrain
- NSText2SQL
- OIG
- Cleaned Alpaca
- Blended Skill Talk
- Camel AI
- Chat Alpaca
- ChatGPT4 Code Instruct
- Code Context
- Code Instruct Multiturn
- Code Instructions 122k
- Code Instructions
- commitpackft
- competition_math
- ConvAI2
- DialoGPT
- Dolly15k
- Evol instruction Function Call
- Evolution instructions
- glaive-code-assistant-v2
- glaive_coder_raw_text
- glaive-function-calling
- GPT4ALL-v1.3
- GPT4ALL
- Lamini
- LLaVA-Instruct-150K
- Ultrachat like using Malaysian context
- Malaysian Youtube Audio Instructions
- Minimath
- Minimath
- MetaMathQA
- Minimath
- Mixtral Magicoder: Source Code Is All You Need on various programming languages
- Mixtral Malaysian Chat
- Mixtral Malaysian RAG
- oasst1
- OpenOrca
- sql-create-context
- router-switch-instruct
- ShareGPT
- Python evol instruct 51k
- Taskmaster
- UltraChat
- UltraChat 200K
- UltraFeedback
- Unnatural Code
- Wizard of Wikipedia
- Unnatural Code
- corpus
- crawl
- 1Media.My
- 9shares.my
- Academia.edu
- https://agbrief.com/news/malaysia
- https://www.agendadaily.com/
- https://www.akademisains.gov.my/asmsj/published-articles/
- https://akuislam.com/
- https://alhijrahnews.com/
- amanz.my
- Scrap Angkasfera (798 kB)
- https://www.apu.edu.my//
- article.poliklinikazzaara.com.my
- asklegal.my
- AstroAwani
- autobuzz.my
- azhafizah.com
- b.cari.com.my
- beautifulnara.com
- https://berita.rtm.gov.my/
- https://bernama.com/tam/
- Bernama
- bjak.my
- blog.fincrew.my
- blog.limkitsiang.com
- blog.malaysia-asia.my
- blog.pandai.com
- blog.yeahhost.com.my
- blogmalaysia.com
- blogtipskerjaya.net
- Buku teks
- buletinmutiara.com
- bullishbursa.blogspot.com
- bumigemilang.com
- bumiinvest20.home.blog
- buro247.my
- c.cari.com.my
- Carigold
- carlist.my
- carsifu.my
- carsome.my
- cn.cari.com.my
- columbiaasia.com
- Crossref
- data.gov.my
- denaihati.com
- https://www.dermatology.org.my/malaysia_journal.php
- dewanbahasa.jendeladbp.my
- discoverkl.com
- diva.my
- doctoroncall.com.my
- dotproperty.com.my
- dsf.my
- e-khutbah
- https://www.e-mjm.org/past_issues.html
- ecentral.my
- edu.my PDF
- ekonomirakyat.com
- enanyang.my
- eniraimathi.blogspot.com
- Eprints Malaysia Universities
- fintechnews.my
- https://fliphtml5.com/
- FMT
- Foodpanda
- fuh.my
- gamerbraves.com
- gamersantai.com
- gamersonduty.com
- gempak.com
- goody25.com
- goodymy.com
- google PDF
- gov.my PDF
- Malaysia Hansard
- hardwarezone.com.sg
- hargaemas.my
- hellodoktor.com
- https://www.heraldmalaysia.com/
- hijabista.com.my
- hostingmalaya.com
- hype.my
- i-fiqh
- ideasaham.my
- IIUM Confession
- https://ikram.org.my/
- ilifepost.com
- imetech.com.my
- https://www.impiana.my/
- infopelajar.my
- intraday.my
- Ipendidikan
- Iproperty
- isaham.my
- https://www.islam.gov.my/ms/e-penerbitan
- https://ismaweb.net/
- isterisihat.com.my
- jbtalks.cc
- jomgaming.my
- https://lamanweb.dbp.gov.my/jurnal/
- https://kakimuvee.net/
- kakuchopurei.com
- kamusbm.com
- karangan.net
- kaskus.co.id
- kebuna.com
- kebunbandar.com
- keluarga.my
- kimchidaily.my
- kisahdunia.com
- klgadgetguy.com
- Klook
- kopiandproperty.com
- Kosmo
- http://latihan-bm.blogspot.com/
- TLDR
- lipstiq.com
- litefinance.org
- lobakmerah.com
- lom.agc.gov.my
- Lowyat
- Lyrics.my
- madreshoy.com
- mahersaham.com
- majalah.com
- majalahpama.my
- https://www.majcafe.com/
- majoriti.com.my
- makanbola.com
- makkalosai.com.my
- maksudperibahasa.com
- maktabahalbakri.com
- malaykord.com
- Malaymail
- malaysia-today.net
- malaysia.tamilheritage.org
- malaysiaindru.my
- malaysianow.com
- malaysiastock.biz
- malaysiatamilkalvi.com
- maskulin.com.my
- Keterangan
- maukerja.my
- mcp.anu.edu.au
- mediahiburan.my
- https://medmalay.com/
- mingguanwanita.com.my
- https://www.mjpath.org.my/past-issue.php
- https://mjpharm.org/
- https://www.morthoj.org/
- Keterangan
- Progres
- Status
- https://www.motomalaysia.com/
- https://www.mps.org.my/index.cfm
- https://www.msss.com.my/mjss/
- mstar.com.my
- muftiwp.gov.my
- murai.my
- Website snapshot
- my.theasianparent.com
- myartis.com
- mycarforum.com
- mygameon.my
- https://myjgeosc.com/
- https://myjms.mohe.gov.my/
- https://myjsustainagri.com/
- mykmu.net
- mymp.my
- myresipi.com
- mysoalan.com
- nambikkai.com.my
- nanban.com.my
- nasilemaktech.com
- https://www.newera.edu.my/publication.php?id=4805&pub=mjcs
- https://news.seehua.com/
- https://nextrift.com/
- nona.my
- nurulzayani.com
- https://nutriweb.org.my/mjn/online-first.php
- ohbulan.com
- mediahiburan.my
- ohmyhome.com
- ohsem.me
- OpenDOSM
- org.my PDF
- orientaldaily.com.my
- parlimen.gov.my
- paultan.org
- pdfdrive
- penuntutilmu.com
- perak.org
- https://www.pgm-my.org/malaysianjournalofgenetics/
- piston.my
- pokde.net
- productnation.co
- propcafe.net
- Scraping PropertyGuru-EN (5.58 MB)
- pt3online.com
- quola.my
- raiz.com.my
- realestatemy.com
- relevan.com.my
- https://resepichenom.com/
- ricebowl.my
- ringgitohringgit.com
- ringgitplus.com
- rojaklah.com
- rootofscience.com
- ruby.my
- sabahpost.net
- sabrinatajudin.com
- salary.sg
- says.com
- https://selangorkini.my/ta/
- https://senaraiperibahasa.com/
- shahbudindotcom.net
- siakapkeli.my
- simplywall.st
- sinar.syok.my
- Sinar Harian
- sinarproject
- sinchew.com.my
- siraplimau.com
- skycrapercity.com
- soalanspm.com
- stories.my
- https://story.motherhood.com.my/my/
- straitstimes
- studentportal.my
- suamisihat.com.my
- https://www.suararisda.my/blog
- sukanz.com
- sunahsukasakura.com
- https://www.surah.my/
- https://tamil.goodreturns.in/topic/malaysia
- tamilmurasu.com.sg
- tantannews.com
- tcer.my
- tech-critter.com
- https://www.techinasia.com/tag/malaysia
- techlagi.my
- technave.com
- tekkaus.com
- teratotech.com
- theborneopost.com
- https://thediagnosa.com/jenis-penyakit/
- TLDR
- Note
- Methodology
- Progress
- thekapital.my
- The Malaysian Insights
- therakyatpost.com
- therooftalks.com
- Ticket2U
- tryandreview.com
- tvpertiwi.com.my
- umminani.com
- umpan.com.my
- https://upsronline.com/
- vanakkammalaysia.com.my
- varnam.my
- viralcham.com
- vocket.com
- vpsmalaysia.com.my
- https://wapcar.my/
- wapcar.my
- https://wartaoriental.com/
- Watpadd
- wiser.my
- Youbaby
- zenthegeek.tech
- zulkiflihasan.wordpress.com
- dictionary
- document-ranking
- dumping
- embedding
- generative
- keyphrase
- knowledge-graph
- lexicon
- llm-benchmark
- llm-instruction
- news
- nlq
- normalization
- ocr
- paraphrase
- parsing
- phoneme
- question-answer
- segmentation
- sentiment
- speech
- speech-to-text
- speech-to-text-semisupervised
- spelling-correction
- summarization
- tagging
- tatabahasa
- text-similarity
- text-to-speech
- tokenization
- translation
- ChatGPT 3.5 Noisy Translation b.cari.com.my
- ChatGPT 3.5 Noisy Translation c.cari.com.my
- ChatGPT 3.5 Noisy Translation Facebook
- ChatGPT 3.5 Noisy Translation IIUM Confession
- ChatGPT 3.5 Noisy Translation Manglish
- ChatGPT 3.5 NLLB Banjarese
- ChatGPT 3.5 Noisy Translation Twitter
- ChatGPT 4 Noisy Translation Twitter to local dialect
- Alignment
- FLORES-200 Evaluation set
- Google Translate EN to MS
- Google Translate filtered Common Crawl
- Google Translate Malaysia Parliament
- Google Translate Malay News
- Google Translate Malay News
- Google Translate Malaysian PDF
- Google Translate MS-ID
- Google Translate MS-JW
- Google Translate MS-PA
- Google Translate MS-TA
- IIUM-Confession
- LASER for eng_Latn-zsm_Latn
- Local EN to MS Subtitles
- Google Translate EN to MS for longer texts
- Alignment
- NLLB EN-MS
- OPUS
- true-case
Social media#
We catch most of live data from Twitter, Facebook and Instagram using crawlers, So we just search using Elasticsearch query.