Uiser:Illandancient/Word frequencies

Frae Wikipedia, the free beuk o knawledge

Word frequencies[eedit | eedit soorce]

At about eleven thirty on Tuesday 25th August reddit user u/Ultach posted in the r/Scotland forum of their discovery that a large amount of the Scots language wikipedia had been created and edited by a teenager from North Carolina who didn't speak the Scots language. The Scots wikipedia at the time had about 57,950 articles of which around 20,000 were created by the North Carolinian, who had ammassed over 200,000 edits over an eight year period.

There was a fuss on Reddit, a fuss in some media, and a group of Scots speakers and wikipedians came together to fix the problem. Whilst I only have a passing knowledge of the Scots language, I was at the time reading Northern and Insular Scots by Robert McColl Miller, about the dialects of Scots spoken in Caithness, Orkney and Shetland. It was interesting stuff. And inspired by the book I thought that perhaps various datamunging techniques could be used to study the North Carolinian Scots dialect used on wikipedia, and also the fixed Scots dialect used on wikipedia, which might eventually amount to a standardised spelling system, and that the Scots Wikipedia could be presented as a corpus of the Scots language.

Acting quickly I found a script on GitHub that could scrape wikipedia and get the word frequencies, realing I'd never get it up and running myself I contacted the creator Ilya Semenov to commission a scrape of the Scots wiki. He very kindly sent through the Scots word frequency list for the 20200801 wikidump in just a few minutes, this proved invaluable in the first few days of the scots wiki clean up exercise.

It was used to identify the most commonly used words of the North Carolina dialect, so that automated tools could be used to fix spellings.

Python scripts[eedit | eedit soorce]

I had a go at running the word frequency script myself, but ran into many problems.

  1. I hadn't used Python in a number of years (8) and my laptop wasn't happy about trying to resurrect old programming environments
  2. The script didn't want to work on a Windows computer, so I had to resort to digging out an old Raspberry Pi linux computer that hadn't been used for a number of years (4)
  3. Flashing the Raspberry Pi with the latest OS lead to it running crazy slowly, webpages took minutes to load, file movements were jerky and slow.
  4. The word-freq script required a wikipedia extractor script where the latest GitHub commit was broken
  5. Using the latest version of Python (3.4) seems to break everything, because although no longer supported, an older version of Python seems to run everything (2.7)
  6. The missus bought me a new fast SD card from Amazon which arrived in less that 24 hours, but this too proved to be crazy slow
  7. Eventually got the scripts to work in the early morning of 01-09-2020, it takes 4449 seconds to process the current scotswiki of 57,000 articles

Comparing word lists[eedit | eedit soorce]

Armed with two different Scots word frequency lists, one from the start of Aug 2020 and one from the start of May, it would be possible to compare them to see which new words had been introduced and if the counts of words had increased or decreased.

Comparing the English word list and the Scots word list might provide a list of words unique to English, unique to Scots and common to both languages. It should be noted that many Scots words just happen to be spelled exactly the same way as many English words, although the definitions, usage and grammar varies.

Old vs New Scots Wiki comparison[eedit | eedit soorce]

Comparing the 01-08-2020 Scots word list with a 01-09-2020 Scots word list should elicit the North Carolina Scots dialect, which, whilst a linguistic dead-end, might be useful.

  • keeng (-1221)
  • televeesion (-850)
  • daughter (-621)
  • than (-593)
  • years (-499)
  • miles (-467)
  • lairge (-455)
  • built (-327)
  • creautit (-318)
  • months (-295)
  • each (-212)
  • well (-198)
  • operating (-195)
  • brought (-175)
  • given (-169)
  • perhaps (-166)
  • system (-144)
  • himself (-142)
  • systems (-133)
  • haeve (-112)
  • lairgest (-105)
  • was (-92)
  • with (-92)
  • keengs (-79)
  • herself (-79)
  • such (-77)
  • they (-68)
  • their (-68)
  • more (-64)
  • height (-63)
  • large (-63)
  • father (-61)
  • mother (-61)
  • together (-61)
  • break (-59)
  • brother (-59)
  • thought (-58)
  • forward (-57)
  • kernel (-56)
  • family (-54)
  • computer (-53)
  • tournament (-53)
  • were (-52)
  • various (-49)
  • program (-49)
  • memory (-48)
  • windows (-48)
  • hardware (-42)
  • total (-41)
  • used (-41)
  • file (-41)
  • use (-40)
  • os (-40)
  • also (-39)
  • insects (-39)
  • programs (-37)
  • user (-37)
  • through (-36)
  • word (-35)
  • granddaughter (-34)
  • unix (-34)
  • eight (-34)
  • computers (-33)
  • these (-32)
  • one (-32)
  • linux (-32)
  • has (-31)
  • tha (-30)
  • software (-30)
  • developed (-29)
  • mode (-29)
  • old (-28)
  • other (-27)
  • device (-27)
  • access (-26)
  • example (-25)
  • resources (-25)
  • have (-24)
  • can (-23)
  • only (-23)
  • en (-23)
  • many (-23)
  • there (-22)
  • interface (-22)
  • toun (-21)
  • after (-21)
  • breaks (-21)
  • server (-21)
  • aa (-20)
  • all (-20)
  • small (-19)
  • keeng's (-19)
  • insect (-19)
  • daughters (-19)
  • freebsd (-18)
  • lairger (-18)
  • number (-18)
  • over (-18)
  • any (-18)
  • code (-17)

(it looks like the automatic replacing of words such as keeng has been successful, but also a number of articles on computer operating systems seems to have changed too.

Comparing the word frequency lists from several languages could help to understand trends in wiki word usage, for example the equivalent words for 'city', 'province' and 'municipality' are more common on wiki pages than general usage word lead us to expect.

Uniquely English words[eedit | eedit soorce]

As a starting point wikipedia user James Salsman has a list of English words not usually seen in Scots

afraid after angry ball before behind between blow carefully cattle child cloth clothes creature cry dig dirty do doubt down dusk dusty ewe fancy four friend from girl going have head hold house hundred knock kye live make mud my now old one our out over pet potatoes shake sing small smell spin stay strange strike stubborn stupid take to today tomorrow town two ugly upside very water what which who woman you

This could serve as a kernel starting point of a whitelist for identifying uniquely English words if we compare their frequency or rank within the Scots and English word frequency lists.

Looking at merely the top 20,000 words in each language, there are 11,515 words common to both languages, and therefore 8,485 words unique to each. This is somewhat flawed as the English wikipedia contains a total over over a million unique words, whilst the Scots wikipedia contain around 55,000 unique words. A google sheet of the word list comparison can be found here.

The top twenty most frequently used words that are common to both are as follows, with Scots wiki occurrences.

  • the (417,894)
  • an (182,133)
  • in 159508
  • is 112395
  • as 46494
  • it 34234
  • on 30780
  • for 30530
  • that 23256
  • at 18388
  • or 18090
  • are 17643
  • he 17047
  • his 15935
  • its 15848
  • which 12692
  • of 12633
  • population 12613
  • municipality (11,635)
  • de (10,525)

Most of these are to be expected, except "de", which is barely an English word, there's something going on here.

The top twenty most frequently used words that are unique to the Scots wikipedia are as follows

  • tae (74,257)
  • wis 59140
  • bi 32329
  • wi 30901
  • frae 26145
  • ceety 17840
  • haes 16541
  • ane 12166
  • aw 10999
  • toun 10742
  • aurie 10281
  • destrict 9301
  • locatit 8893
  • haed 8846
  • maist 8258
  • efter 8123
  • pairt 7353
  • twa 7133
  • ither 6756
  • hae (6,505)

I'm skeptical whether "ceety" or "aurie" should be anywhere near a most frequently used Scots word list.

The top twenty most frequently used words that are unique to the English wikipedia are as follows

  • given
  • himself
  • brought
  • defeated
  • opening
  • competed
  • township
  • households
  • moving
  • featuring
  • accepted
  • providing
  • household
  • surrounding
  • painting
  • losing
  • resulting
  • suggested
  • allowing
  • founding

These are all words that have been eliminated mechanically using bots, so no surprises. If we instead ignore words ending with 'ing', the next twenty words are:-

  • unknown
  • slightly
  • height
  • band's
  • herself
  • ncaa
  • norway
  • roof
  • follow
  • perhaps
  • users
  • collected
  • paintings
  • grounds
  • musicians
  • musician
  • moth
  • owners
  • thousands
  • wounded

"ncaa" might come as a surprise, it has 104,048 occurrences, a bit of wiki-fu elicits that it is the en:National Collegiate Athletic Association, perhaps there are 104,048 individual athletes with their own wikipedia pages. The Scots should count their blessings that they have avoided this, although there does seem to be an abundance of pages about European royalty, and death metal bands.

Comparisons with Gaelic[eedit | eedit soorce]

Once I was aware that there was actually a Gaelic language wikipedia, albeit with only 14,000 pages, processed it for a word frequency list, which was then compared with Scots. The results can be found on a google spreadsheet here.

It seems that due to the small size of the Gaelic corpus here, the word list 'bottoms' out. I only consider the top 20,000 words in each language, which ought to filter out words that are only in a handful of articles and keep the ones in most common use. There are a lot of words showing up as being common to both languages, when in actually fact looking at the numbers of occurrences they are very rare in Gaelic and very common in Scots.

Scots Gaelic word count comparisons
Word Scots occurrences Gaelic occurrences
or 18,090 64
are 17,643 36
he 17,047 46
his 15,935 76
its 15,848 42

Due to comparative rarity of these words in Gaelic, should they be considered unique to Scots (outwith English)? The Gaelic wikipedia is about a quarter the size of the Scots wikipedia, we might expect there to be only factor of four difference between the most common words.

Perhaps the entire comparison of overlapping word spaces needs to be done by hand, aided by algorithm rather than completely automatically. Maybe considering the percentile ranks of each word in each language and their ratio to each other would eliminate any artifacts from the corpus size. For example "its" is a 99th percentile word in Scots, but a 1st percentile word in Gaelic, so effectively it is unique to Scots. This might take more processing than Excel can manage alone, I'll have to roll my sleeves up and get back into perl.

Article count and stats[eedit | eedit soorce]

  • 2005-06-23 00:00 0 airticles
  • 2020-08-01 56,524 unique wirds, 4,458,169 total wirds (twa letters or mair, three occurrences or mair)
  • 2020-08-29 23:30 57,956 airticles
  • 2020-09-01 09:39 57,984 airticles
  • 2020-09-01 56,828 unique wirds, 4,500,147 total wirds (twa letter or mair, three occurrences or mair)

Citogenesis[eedit | eedit soorce]

Citogenesis is a satirical theory about where citations come from, as explained in this XKCD comic. Made up on wikipedia -> used in official documented -> cited on wikipedia. There appear to be a few examples on the Scots wiki.

  • conseeder
  • proposeetion
  • peteetion

These words are very rare on the Scottish Corpus website, yet are very common on the Scots wikipedia. It is believed that the Online Scots dictionary harvested words and definitions from wikipedia. These words have been used in the Scots translation Scottish Parliament document "Your Scottish Pairlament". This document has subsequently been cited in the Scots wikipedia, thus completing the circle.

It is possible that these spellings are used in the north-east Scots dialect, 'conseeder' was used by north eastern author Sheena Blackhall in 1996, the other words have not.