The next 5 paragraphs are me whining. To get to the real import stuff, start at paragraph 6

So I have been pouring two weeks into WildOn, which is finding out how many addons exist out there in the wild. But before I start unleashing web crawlers on the web causing havoc and chaos, it will be helpful if we could compare what’s out there with what we know. What we know is everything from AMO, so we start there. The point of this extra work is to have some results, so that when we release a web crawler on AMO and tell it to find all the extensions, we’ll have something to compare it’s results to.

Actually, even this was a bit confusing. AMO provides an API to view its addons (well actually, two versions of the API, with the older being slightly more useful). But that information was eventually scrapped for several reasons. The main one being is that there is a lot of information on AMO that isn’t on the extension itself (such as, What operating systems are supported, and is the addon a theme or an extension. While the former has been supported since Firefox 2, I have rarely seen it used, the latter is completely optional). This makes any sort of conclusion inconclusive because you don’t have enough information.

Then there was the problem of having too much information in the database. To the point where ~4000 addons took up ~1.8gigs of information. To an sqlite datbase, this can get slow. When you try some queries, such as the number of extensions that support the ‘jp-JP’ locale, this can get to be even more intensive process as you build a table that comprises of tens of thousands of rows (one row for each guid/locale combination). The reason for this is because older versions where being included in the same table as the newest version of the addon. Some addons had something like 50+ different versions. The solution seemed to be to move old extensions to a different tables. SQL queries seem to go much faster.

Another issue that makes me loathe RDF is install.rdf. I strongly disagree with the use of rdf for anything :) It becomes difficult to parse with a regular xml parser (there are a few python rdf libraries out there. But rdflib, the most promising, seems to like not working and not having good examples. Only sheppy can save them now, but he’s working on mdc). Especially with rdf:resource, which I am completely ignoring right now. So it seems that AMO editors like to get creative with install.rdf, which has caused problems for me (eg. I can not rely on targetPlatform. Some extensions actually have their targetPlatoform in the Description tag. I know this because one of the extensions had Firefox’s GUID :(). Also, some other quirks like having the id as an attribute of Description instead of a new tag. All things that are probably perfectly valid, but make my life significantly more difficult.

YAP was that many early extensions did not use chrome.manifest. And some newer ones don’t. So to look up locale information, they were either in install.rdf or contents.rdf. This makes me (and by extension, kittens and baby Jesus) sad. I don’t have a fix for this yet.

But enough about problems, what about SUCCESS!?

Ok. So I managed to get a local copy of every extension that is on AMO. Since parsing an analyzing and writing to persistent storage takes a long time, I decided to save myself some trouble and just do the first 2500 extensions (out of the ~7K folders that I have).

Of the 2500 ‘extensions numbers’, 1630 where successfully analyzed. This is mainly because extension numbers don’t increment perfectly (eg. there is no addon #1. The first one starts at #4. Only about 100 addons failed to parse, giving me a success rate of 94%. Some extensions had quirks in them (eg. bad RDF) that were either invalid or I couldn’t figure them out.

Out of the 1630 extensions, this is what xulrunner-like applications they supported :

And Here are the approximate numbers :

Name Count
Prism/Webrunner 2
Songbird (old) 2
Instant 1
Midbrowser 3
toolkit (any gecko 1.9 application) 7
eMusic DLM 12
Seamonkey (broken GUID) 2
Nvu 11
Sunbird 16
Thunderbird 256
Songbird 13
Seamonkey 101
Flock 159
Netscape Navigator 68
Mozilla Suite 166
Firefox 1466

This looks ok so far. One expects a few non-Firefox extensions. The Thunderbird numbers seem a little low. Reminder that this is only ~33% of the total addons.

Locales seem to be a bigger mess, as there are many early extensions that don’t use chrome.manifest, so I decided to skip it, but now realize I have to fix it. Out of 1630 addons, only 464 addons had chrome.manifest files that I was able to read. But here is the breakdown anyways :

Number of locales : 173 (en, en-US, en-GB are all considered different locales). There are some invalid locales. For example, Xultris has an invalid locale called xultrisLocale. This can be fixed with a regex expression, but anyways.

Locale Supported Extensions
en-US 439
sv-SE 57
it-IT 190
de-DE 189
pl-PL 137
es-ES 181
fi-FI 64
ru-RU 129
nl-NL 145
pt-BR 162
fr-FR 204
ja-JP 124
zh-CN 126
zh-TW 114
ko-KR 86
cs-CZ 90
en-GB 29
es-AR 54
mn-MN 4
ro-RO 30
sk-SK 118
ca-AD 56
el-GR 38
pt-PT 49
ar 18
uk-UA 61
sr-YU 12
bg-BG 28
hu-HU 84
hr-HR 64
da-DK 92
nb-NO 32
sl-SI 23
lt-LT 21
tr-TR 72
ar-TN 0
de-AT 10
he-IL 41
el 6
ja-JA 1
mk-MK 10
be-BY 25
sq-AL 8
en 19
de 22
es 7
km-KH 6
th-TH 14
it 13
az-AZ 2
id-ID 8
fy-NL 13
fa-IR 33
af-ZA 8
ar-SA 4
cy-GB 0
gl-ES 11
ms-MY 3
ar-JO 1
es-CH 0
es-CL 6
am-HY 1
hi-IN 5
vi-VN 4
en-AU 5
cz-CZ 1
he 1
fa 1
ur 1
ja 18
fr 23
nl 9
pl 9
ru 14
sk 15
eu-EU 1
de-CH 5
ko 4
hr 1
sr-Yu 3
ga-IE 7
pt-PR 0
tr 3
cs 4
hu 7
en-BZ 3
en-CA 4
en-IE 3
en-JM 3
en-NZ 3
en-PH 3
en-TT 3
en-ZA 3
en-ZW 3
es-BO 1
es-CO 1
es-CR 1
es-DO 1
es-EC 1
es-SV 1
es-GT 1
es-HN 1
es-NI 1
es-PA 1
es-PY 1
es-PE 1
es-PR 1
es-MX 2
es-UY 1
es-VE 1
fr-BE 2
fr-CA 2
fr-CH 2
fr-LU 2
fr-MC 2
eu-ES 3
zw-TH 0
da-DA 1
be 1
eo 1
ca 7
pt 2
ar-DZ 1
jp-JP 0
et-EE 2
nl-BE 1
eu 1
en-EN 0
sr-CS 1
ua-UA 1
no-NO 1
mn-MK 0
sl-SL 2
is 2
nn-NO 1
lv-LV 0
uk-AU 1
ja-JP-mac 2
ml-IN 1
wa-BE 1
is-IS 2
ca-ES 0
sv 1
fr-fR 0
da 7
fi 2
ro 1
ar-LB 0
sr-RS 3
en-UK 2
es-US 1
de-LI 1
de-LU 1
ko-Kr 1
no 1
zh 1
bg 1
tl 1
sr 1
sq 1
sl 2
xultrisLocale 1
ca-CD 1
se-SV 1
mn 0
mk 1
pa-IN 0
ka 1
lt 1
uk 2
ar-AR 1
he-HL 0
convertLocale 1

Some locales will have 0 supported extensions. This is because We are only counting the most up-to-date extension, and not counting previous versions which may have supported that locale. While doing a graph for each locale would be unwise, a much wiser choice would be to break it down into language.

So which languages are best supported?

Language Extensions supported
en 462
sv 58
it 202
de 212
pl 145
es 192
fi 66
ru 143
nl 154
pt 165
fr 225
ja 142
zh 148
ko 91
cs 94
mn 4
ro 31
sk 133
ca 64
el 44
ar 21
uk 64
sr 19
bg 29
hu 91
hr 65
da 100
nb 32
sl 27
lt 22
tr 75
he 42
mk 11
be 26
sq 9
km 6
th 14
az 2
id 8
fy 13
fa 34
af 8
cy 0
gl 11
ms 3
am 1
hi 5
vi 4
cz 1
ur 1
eu 5
ga 7
zw 0
eo 1
jp 0
et 2
ua 1
no 2
is 4
nn 1
lv 0
ml 1
wa 1
tl 1
xultrisLocale 1
se 1
pa 0
ka 1
convertLocale 1

And here is the obligatory graph for those numerically challenged by high school mathematics teachers.

top 10 languages for 464 analyzed extensions

So what does this lead to? First I need to fix locales. We need to get the vast majority of them. Next, I want to profile all the extensions and not just the first 2500. And then, I want to start looking at web crawlers and learning how to crawl a simple website before unleashing a monster on AMO.

Leave a Reply