English is no longer the language of the web

Flags of the world

By Ethan Zuckerman from Quartz.com:

Flags of the world

Conventional wisdom suggests that English is becoming “the world’s second language,” a lingua franca that many forward—looking organizations are adopting it as a working language. Optimists about the spread of English as a global second language suggest it will enable collaboration and ease problem solving without threatening the survival of mother tongues. Pointing to hundreds of thousands of Chinese children who learn English by shouting phrases back at teachers, the American entrepreneur Jay Walker offers the idea that English will be a language of economic opportunity for most speakers: they’ll work and think in their mother tongue, but English will allow them to communicate, share, and transact.

Cultural-preservation organizations like UNESCO aren’t as confident of this vision. They warn that English may crowd out less widely spoken languages as it spreads around the world through television, music, and film. But something more subtle and complicated appears to be going on. While English may be emerging as a bridge language, a wave of media is being produced in other languages, in newspapers, on television, and on the Internet. As technologies make it easier for people to communicate to broad and narrow audiences in their native languages, we’re discovering that linguistic difference is surprisingly persistent.

One way to consider the future of language in a connected world is to ask, “What percent of the Internet’s content is written in English?”

Look online for an answer to that query—posed in English—and you’re likely to encounter a website last updated in 2003, EnglishEnglish.com. The site’s “English Facts and Figures” page asserts that “80% of home pages on the Web are in English, while the next greatest, German, has only 4.5% and Japanese 3.1%.” The sources behind this assertion are unclear, but it’s consistent with early research on linguistic diversity online. In 1997, Geoffrey Nunberg and Hinrich Schütze released a study estimating that 80 percent of the World Wide Web’s content was in English. The Online Computer Library Center followed in 2003 with a study estimating that 72 percent of online content was in English.

These early studies led researchers to suggest that English had a “head start” that other languages would find difficult to overcome. With such a large user base of English speakers online, many websites would publish content only in English, and web users would adapt to monolingualism by improving their language skills, which in turn would increase the incentive to publish in English. Neil Gandal of Tel Aviv University analyzed web use in Quebec, Canada, in 2001 and observed that native French speakers spent 66 percent of their online time on English-language websites. Furthermore, young Quebecois looked at more English content than their elders, suggesting that language barriers would be even less relevant for a future generation of web users. Given that Francophone Quebecois were willing to read English content online, Gandal argued, website developers wouldn’t bother to localize their content, leading to a future with more sites entirely in English.

Both the 70–80 percent English “fact” and the head start theory have lingered despite evidence that the linguistic shape of the World Wide Web has changed dramatically in the past ten years as it expanded both in scale and in the number of authors creating content. One reason the “fact” persists is that it’s incredibly difficult to generate a believable estimate of language diversity online. Early studies tried to create a random sample of websites by choosing a selection of IP addresses, loading whatever page emerged, and using automated tools to determine what language it was written in. This method works poorly these days, when sites like Facebook, reached via a single IP address, include multilingual content generated by more than half a billion users. Newer methods rely on search engines to index the web, then attempt to estimate coverage of different languages on the basis of the comparative frequency of words in different languages.

[Full article]