Most of you are aware that VK7 came online at the beginning of October. We haven’t pushed the data to the database for a variety of reasons as normally happens at the start of the month. I’m delighted to say that the database has now been updated with the new VK7 association data. At the same time an update for OM was loaded. So thank you for your patience in waiting for this data to appear.
This was uploaded with a (marginally) different uploading tool than normally used. I have a feeling the names in the OM update may be a little more mangled than expected. I plead ignorance to character set issues… I leave them to other people to deal with in my day job. However, I’ll be working with Csaba YO6PIB to resolve these in our tools and process so that we should have fewer mangled characters in future. I must admit that I do suffer from the arrogance of many UK/US based programmers that the default character sets work for me and it is someone else’s issue when they don’t which is why working with someone who suffers from the results of this mindset will be good for me and for non-Roman character users. At the same time we’ll work on correcting those which are already broken. Just don’t expect a fix tomorrow.
Jon will update SOTAwatch in the near future which will mean you can then spot and alert, but existing logs for these new and amended associations can be uploaded now.
Slightly OT: some of you may have noticed that many of the missing ARM (association reference manuals) have been uploaded. Jimmy M0HGY has been running this process. We’re not there yet but the number of outstanding updates and missing documents is falling steadily thanks to Jimmy and the AMs involved.
Regrettably, I fear you are correct - it’s quite a lot worse. I suspect many of the names are now unrecognisable. Characters which were previously substituted with something vaguely plausible (e.g. the same letter without an accent) are now just “?”.
I’m not convinced that character sets aren’t a long standing issue for UK (as opposed to US) folk. I’ve spent the majority of my career using a character set with no representation of the pound sterling sign.
At least nowdays Unicode largely solves the problem, if only we can get into the position of using it everywhere.
One solution would be to have another column in the table giving the International Phonetic Alphabet (IPA) phonetic notation for the name. I will agree straight away that this would involve a lot of extra work and yet another character set, and pretty well all of us will be unfamiliar with this notation anyway, but at least it would make it possible for people to get close to the correct pronunciation of unfamiliar names - in the UK, for instance, those tiresomely long Gaelic names would be somewhere near correct! (Ducks for cover!)
It can’t be that hard TBH. As I said I’ve always avoided encoding issues by letting other people handle it in my day job. It displays fine in the source files and when the source data is extracted to a CSV it still displays fine in text editors like Notepad++. They’re all using UTF-8 if I understand things correctly. The issue appears to be the uploading tool which has support for assorted Windows codepages.
It’s merely a job of getting all the software to use the same encoding. Merely!
Looks like this is not too tricky for OM… I was having some issues getting the uploader to use UTF8 encoding. Using the option to force it to detect byte order marks and ensuring the downloaded file was saved with UTF-8 BOM results in a set of OM summits names that look good. Well it looks good on my local test system. I’ll push them to the real database tomorrow.
Yes it would be useful, but as you say, a lot of work requiring substantial expertise.
Not at all. Once you’ve cracked the problem of getting Unicode used throughout, IPA is easy.
But taking a wider view, I think English has a far greater need of phonetic transcription than Gaelic.
Something which is perhaps more feasible would be to have separate database columns for:
Summit name in native script.
Summit name in Latin alphabet transliteration (possibly with accents, but basically the alphabet that English speakers recognise). I think most languages we’re likely to encounter have at least one generally accepted transliteration scheme.
Obviously for most of the existing summits the two will be the same (and you could represent this compatibly by having the “native” column empty). Work would be needed for associations such as South Korea, Greece and Ukraine, but it could be done gradually as time permits. In the case of Ukraine it appears that there has already been a mangled attempt to squeeze both representations into a single string, but I don’t think this is ideal even if done correctly. Keeping them separate would mean that applications could choose which to display.
The uploader already supported Windows codepages (normal is 1252 for Western Europe and 1250 for Eastern Europe). The issue was noobness by me in producing CSV files that were either UTF-8 or 1252 encoded but not 1250 encoded. Feeding UTF-8 data into 1250 encoded streamreaders doesn’t produce the right answer and feeding 1252 encoded files doesn’t work either.
A bit of googling/MSDNing/coding got a UTF-8 uploader that would push the right data although the flow was horrible. Save sheet from Google docs as TSV, (UTF08 with BOM unix format) followed by unix2dos on the TSV followed by uploading with the new uploader. Everything UTF-8… characters look good. Given that we have loaded non-Western characters before I knew I was missing something. It turns out a bit of playing with OpenOffice was needed. A nice flow is download summit lists as xlsx from Google docs, load into OO, save as CSV and set options in the dialog for Tab separator and codepage = 1250. Then uploading the file with the old uploading code set to 1250 works too.
The pain is not quite great enough to warrant a fully automated system! Anyway, reports of issues with OM welcomed.
Latin alphabet transliteration will hit problems, too. For instance there are two transliterations of Chinese, Pinyin and Wade-Giles, neither of which seem to work at all well. There the problem is sounds that don’t occur in the latin alphabet, plus pitch inflexion - none latin sounds are of course familiar to us with things like the Welsh “ll” or Gaelic “ch”. For all that, though, you might be right about using latin alphabet transliteration, it would be easiest for people to relate to.
Sure, you just have to pick one. The association manager can decide.
It doesn’t really handle the pronounciation of the names at all, though in many cases it will get you close to an acceptable “Anglicised” pronounciation such as the BBC pronounciation unit would now recommend for broadcast use. But that is true to some extent for pretty much every foreign language.
The main purpose of transliteration is to give people who only know the Latin alphabet something that they can recognise, compare, write down etc. At the moment we’re exclusively using transliterations for associations which do not use the Latin alphabet. I presume that if China joined next month, we’d put transliterated names into the database. My suggestion is simply to allow a native script representation as well.
Which requires the people I’m thinking about to use a foreign script.
OK, so they have to do so for callsigns anyway, so it’s not a big deal in practice. But I just thought it might be a friendly gesture to allow association managers to record the names of their hills correctly in their native language alongside the transliterated form that we currently use.
Thanks. Looks like we’re now back where we were, i.e. plausible substitutions.
Taking as an example OM/NR-001:
In the ARM, the name is “Veľký Inovec”.
Before the OM update, it appeared in the database as “Velký Inovec”.
The “bad” update mapped it to “Ve?ký Inovec”.
We’re now back to “Velký Inovec”.
The third letter here is “Latin small letter l with caron”, Unicode code point U+013E. (It looks a bit like “l” followed by an apostrophe but it isn’t - the visual appearance of the caron accent gets modified when it lands on top of the letter “l”). The code page you are using in the database has no code point for this, so your upload has to map it to an ordinary letter “l”. The “bad” upload was deciding it couldn’t handle it and replaced it with a question mark.
The fifth letter is “Latin small letter y with acute”, Unicode code point U+00FD. Your database code page does have this character, so it can be translated without substitution.
The generation of the CSV file is correctly converting your code page back to Unicode and encoding it in UTF-8. This is why it “looks right”.
So it seems that you’ve got Unicode in, and Unicode out, but a smaller encoding in the middle.
This is of course entirely consistent with your observation some while ago that MS-SQL does not natively support Unicode strings. As we discussed then, the only way round this would be to regard them as arbitrary byte sequences, so that whatever goes in comes out.
Right, thanks for that. At least we are no worse than before and have the new summits.
I had a play with something badly broken, Ukraine. The original UT files are available and online they show Cyrillic chars, downloading UTF8 CSV files they show Cyrillic. When I breakpointed the uploader, I could see the filename in Cyrillic in VS2010. But they get bashed in the db. Not surprising as the db doesn’t support Unicode as is. It is fortuitous that so much does work.
Changing the type of the summit name can be done and the existing data should convert OK. It will be fun to see what happens and if the code is happy. In theory it should be, all .NET code runs String types as UTF16LE. Code for the uploader uses a streamreader with selectable encoding to create Strings. The fact I can see the correct characters when debugging it makes me feel positive. The code that dumps the summitlist.csv file may need some work.
In the absence if regressions we can use the “many eyes” debugging methods.