International Character Support

MM0FMF · 18 October 2014 15:18

As discussed a few days back in New Associations & Updates October 2014 I had finally figured out how to get proper non-Latin character support working on the database.

The changes to the database itself were made and to test operation the Ukrainian association summit data was reloaded. This time the names appeared in Cyrillic and English. I have now updated the database application so that you can search for summit names using non-Latin characters in addition to Latin characters.

Everything looks to be working. But as you are used to my software testing skills there may something broken that was working. If you find something then please let me know.

Note that only Ukraine UT has been reloaded. All other associations that use characters that were not part of Windows codepage-1252 will still be wrong. Some may be best fit characters others will display garbage. We will re-upload these associations bit by bit over the coming months. It’s a slightly involved process with lots of places for making errors so please be patient whilst Tom and me do this.

Off the top of my head the following associations need reloading: E7, HA, OK, OM, S5, SP, YO, YU, Z3. I am hoping that the all the correct characters are in the data files we use to load the database. For SV and HL, there is more work as I don’t see Greek and Hangul characters in the data files. This means there will be some work for the AMs to do.

Andy, MM0FMF
obo SOTA MT

K6EL · 18 October 2014 15:46

Great. Now we can convert all zeros to oh-slash. 2E0OOO becomes 2EØOOO, just as used in all ham radio magazines in the USA.

We will all be honorary Norwegians. We might even get a medal from their king !!

Elliott, K6EL
Laden with medals

M1MAJ · 18 October 2014 18:04

No no no, please please please don’t anybody do this!

Zero is zero; changing it to Ø just to make it look different is just wrong and causes all sorts of trouble - e.g. you can no longer search properly. Crossing zeros is a useful convention in handwriting to distinguish 0 from O. It’s not necessary in most computer fonts as the two characters usually look different anyway. If you really want a “crossed zero”, use a font which has it at the code point for zero.

Martyn M1MAJ

HL5ZBA · 19 October 2014 02:42

Are you using UTF-8? If not, what encoding are you using internally?

MM0FMF · 19 October 2014 08:30

That is a very good question!

The database and all the tools are Windows based. Windows uses UTF-16LE inside the OS, in all .NET apps. MS-SQL uses UCS2 for wide characters or UTF-16 depending on the collation. I haven’t changed the collation and I don’t actually know what it is set to.

The files are UTF8 I think as I looked in the actual files saved by the database (summitslist.csv). All the Cyrillic chars are 2bytes long and are not saved to even boundaries which says the data is byte aligned not word aligned, so UTF8 not UTF16. I could be wrong here, I’m assuming nobody would be mad enough to save word data on an odd boundary.

The summit UT/CA-001 is Говерла, the first character is “Г” and that is appearing in the data as 0xDO93. The Unicode definition is U+0413 so this seems correct. There is no BOM in the file which I think means it’s not UTF-16.

A lot of words to say I think files are UTF8!

I guess this is because you would like to get some Hangul characters into the HL summit names. If you have access to the summit list file for HL then if you edit just the 1st ten summit names to have Hangul then Latinised in brackets like so:

and send that to me I will try it on my test system and send you the results for you to look at. If it works then we know the process needed to update the files for HL and they can be loaded during the big reload.

IZ1KSW · 19 October 2014 09:05

Thanks for the effort Andy.
Last night I was able to import the new summit list into an utf-8 mysql database without any issues.

Gab, IZ1KSW

M1MAJ · 19 October 2014 11:36

Yes they are. You would have heard from me if they weren’t

Not necessarily. You can put a BOM at the start on any Unicode file, whatever its transformation encoding. In some quarters there is a bit of a holy war about this: some folk say there should always be a BOM to help you determine the encoding, and others say that the very concept is the spawn of the devil and BOM is a spurious character which just gets in the way. The bottom line is that if there isn’t one, you may have to guess the encoding, and UTF-8 tends to be a good bet.

UTF-8, UTF-16 and UTF-32 can all represent any legal Unicode character and you can convert between them without loss, so it doesn’t really matter which of those is used internally. UCS-2 is restricted to 16-bit characters.

Martyn M1MAJ

MW0WML · 19 October 2014 23:34

Back in the day I managed with a six bit Character Set on 1900 series ICL mainframes… it never did me any harm… well apart from ALWAYS HAVING TO WORK IN UPPERCASE…

73 Gerald
(now feeling very old)

HL4ZFA · 20 October 2014 04:40

This looks like good progress…however, when making the original database files I stuck the Hangeul and Hanja in different columns…would it be possible to to adjust the database to handle it this way? Unless you could have more than one pair of brackets in one cell…Why wouldn’t different pieces of data be separated?

Also, are we limited to two character sets per association? In a perfect world, there’d be Hangeul, Hanja, then romanised…

73 & good work from the “give them an inch and they’ll take a mile” dept–
HL4ZFA

HL4ZFA · 20 October 2014 04:43

I’ve posted some changes with Hangeul and romanised in brackets so we can see what the effect is next time a db reload takes place.

Fingers crossed…

(pls ignore the duplicate posts I’ve deleted…still playing around with how to properly reply, quote, and reply to replies!! ^^)

MM0FMF · 20 October 2014 06:43

I’m not sure what you have done but I can’t see the file anymore. Can you check the sharing options etc.

HL4ZFA · 20 October 2014 06:49

hmmm…didn’t change anything with the sharing properties–I’m actually not the owner. Could it be simply it changed places (timewise, last edit)? If searching for “HL” in your drive page doesn’t turn it up–

What’s the address that it should be shared with? It’s currently accessed by 13 people…

MM0FMF · 20 October 2014 08:24

And mysteriously it has appeared in my list of files.

Oh, how I love Google Docs and other such cloudy wonderousness where you aren’t sure where anything lives and the UI changes hourly!

HL4ZFA · 20 October 2014 12:34

I did send you an invite, to your moose address, but I don’t know how you were accessing it before…maybe this nudged one of your ghosts that was hiding in the sheets…

DM1CM · 20 October 2014 13:18

Well, it’s certainly the case that if one saves a standalone PHP script with one of those little devils, the thing usually won’t run.

MM0FMF · 20 October 2014 22:01

Here is what the updated summits names look like for Korea:

Do they look OK?

HL4ZFA · 21 October 2014 01:15

Looks great–exactly how it should. Aside from looks ^^ how does it search? Would searching with a partial string still give us our results? IE entering only Hangeul, when there exists a Hanja equivalent also…

(I doubt anybody will be searching with Hanja, but you never know–it’s more for informational/meaning purposes…but I suppose it’d be fun to eek out all the chicken mountains, for example, and create an award for activating all of them ㅋㅋㅋ)

Now, the other big question…is there a way that’s easier to give you the data than the parenthesis notation? 2500 summits might take a while, especially if I can’t figure out the macro…

…thumbs tucked…

MM0FMF · 21 October 2014 07:34

I tried the following when testing search…

displayed UT/CR summits on the “List of All Summits” page as this has Cyrillic and Latin text
copied some Cyrillic characters from the middle of a summit name with the mouse and ctrl-c
pasted those characters into the search box on the “Find Summit” page
cheered when the correct summits were located and their details displayed.

I think search works for non-Latin characters.

DM1CM · 21 October 2014 12:24

If anybody’s searching for the South Korean summit HL/JB-212 Daegksan, then good news: it’s turned up safe and sound in:

<found> Syria! </found>

Rob

HL4ZFA · 2 February 2015 13:38

Digging up old graves here, but it reappeared to my attention that while non-latin characters might be enjoying their heyday on the database, something’s gone awry with the sotawatch summit pages…Hangeul had formerly (2010-2014) displaying with no issues but now show up as can be seen in the bottom link:

However, when I click edit, the Hangeul shows up:

Though, if I think I’m safe and choose “Update Link” the Hangeul disappears and is replaced with question marks “??? Map ??”, should I click edit once again, no more Hangeul, just the question marks. It won’t be such an issue if the summits get their bilingual listings in the future, as that’s mainly what I was using the external links for (displaying summit names in native script).

As an interesting and totally unrelated side note, after having added my last link of the day (in an effort to catch up and string together some other local activators’ reports) I noticed the summit info count hit the magical number that many of us out there might have as their luggage lock combination… ^^

73 de HL-land…