Languages stats with ISO codes, speakers count, countries list and its population
When I work on multi-lingual projects, it always takes a lot of time to estimate and prioritize the localization process. Even if you use machine translation, you can’t just translate all languages in the world, since it takes a long time, and LLM will spend all your money on garbage generations like chars “aa” repeated thousands of times in a row. I built the NPM package langstats to solve this problem.
Now you may just hit npm i langstats
and fetch the top used languages, optionally filter out languages by target countries where your app is present, then leave the top N languages and have a list of languages for translation.
Simple demo that just draws stats of the top 20 spoken languages:
Top languages:
#1 Chinese (zh,zho)
- Total speakers: 1299877520
#2 English (en,eng)
- Total speakers: 1132366680
- Top 10 countries where used: India, United States, Pakistan, Nigeria, Philippines, United Kingdom, South Africa, Tanzania, Kenya, Uganda
#3 Chinese (cmn)
- Total speakers: 897071810
#4 Spanish (es,spa)
- Total speakers: 485000000
- Top 10 countries where used: Mexico, Colombia, Spain, Argentina, Venezuela, Chile, Guatemala, Ecuador, Bolivia, Honduras
#5 Arabic (ar,ara)
- Total speakers: 422000000
- Top 10 countries where used: Egypt, Algeria, Afghanistan, Sudan, Iraq, Morocco, Morocco, Saudi Arabia, Yemen, Syria
#6 Bangla (bn,ben)
- Total speakers: 300000000
- Top 10 countries where used: Bangladesh
#7 Portuguese (pt,por)
- Total speakers: 254300000
- Top 10 countries where used: Brazil, Angola, Portugal, Timor-Leste, Cape Verde, São Tomé & Príncipe
#8 French (fr,fra)
- Total speakers: 208157220
- Top 10 countries where used: Congo - Kinshasa, France, Canada, Côte d’Ivoire, Cameroon, Chad, Senegal, Rwanda, Benin, Burundi
#9 Indonesian (id,ind)
- Total speakers: 198996550
- Top 10 countries where used: Indonesia
#10 Russian (ru,rus)
- Total speakers: 171428900
- Top 10 countries where used: Russia, Kazakhstan, Tajikistan, Belarus
#11 Japanese (ja,jpn)
- Total speakers: 128000000
- Top 10 countries where used: Japan, Palau
#12 Punjabi (pa,pan)
- Total speakers: 125000000
#13 German (de,deu)
- Total speakers: 105000000
- Top 10 countries where used: Germany, Belgium, Austria, Switzerland, Luxembourg, Liechtenstein
#14 Egyptian Arabic (arz)
- Total speakers: 100542400
#15 Javanese (jv,jav)
- Total speakers: 84308740
#16 Marathi (mr,mar)
- Total speakers: 83100000
#17 Swahili (swh)
- Total speakers: 82300000
#18 Turkish (tr,tur)
- Total speakers: 82231620
- Top 10 countries where used: Türkiye, Cyprus
#19 Telugu (te,tel)
- Total speakers: 82000000
#20 Wu Chinese (wuu)
- Total speakers: 81400000
There are stats about over 1600 languages and over 180 countries.
As you can see in the example above, some data may be missing, for example, no countries for the Chinese language are listed. That’s because it is hard to find authentic data about languages. Data will be updated and enriched in future updates.
Of course, I could add some data manually, but the purpose of this package is to provide real data, not random numbers based on how I feel. This approach distinguishes langstats
from many other datasets where random generated data is present.
By the fact, that this is a dataset I maintain for my projects, you can start using it too, and cut off your headache around data mining and verification.
If you know where to find data about language usage, create an issue at the GitHub repo and share it there, to improve the stats precision.