As discussed a few days back in New Associations & Updates October 2014 I had finally figured out how to get proper non-Latin character support working on the database.
The changes to the database itself were made and to test operation the Ukrainian association summit data was reloaded. This time the names appeared in Cyrillic and English. I have now updated the database application so that you can search for summit names using non-Latin characters in addition to Latin characters.
Everything looks to be working. But as you are used to my software testing skills there may something broken that was working. If you find something then please let me know.
Note that only Ukraine UT has been reloaded. All other associations that use characters that were not part of Windows codepage-1252 will still be wrong. Some may be best fit characters others will display garbage. We will re-upload these associations bit by bit over the coming months. Itâs a slightly involved process with lots of places for making errors so please be patient whilst Tom and me do this.
Off the top of my head the following associations need reloading: E7, HA, OK, OM, S5, SP, YO, YU, Z3. I am hoping that the all the correct characters are in the data files we use to load the database. For SV and HL, there is more work as I donât see Greek and Hangul characters in the data files. This means there will be some work for the AMs to do.
No no no, please please please donât anybody do this!
Zero is zero; changing it to Ă just to make it look different is just wrong and causes all sorts of trouble - e.g. you can no longer search properly. Crossing zeros is a useful convention in handwriting to distinguish 0 from O. Itâs not necessary in most computer fonts as the two characters usually look different anyway. If you really want a âcrossed zeroâ, use a font which has it at the code point for zero.
The database and all the tools are Windows based. Windows uses UTF-16LE inside the OS, in all .NET apps. MS-SQL uses UCS2 for wide characters or UTF-16 depending on the collation. I havenât changed the collation and I donât actually know what it is set to.
The files are UTF8 I think as I looked in the actual files saved by the database (summitslist.csv). All the Cyrillic chars are 2bytes long and are not saved to even boundaries which says the data is byte aligned not word aligned, so UTF8 not UTF16. I could be wrong here, Iâm assuming nobody would be mad enough to save word data on an odd boundary.
The summit UT/CA-001 is ĐĐŸĐČĐ”Ńла, the first character is âĐâ and that is appearing in the data as 0xDO93. The Unicode definition is U+0413 so this seems correct. There is no BOM in the file which I think means itâs not UTF-16.
A lot of words to say I think files are UTF8!
I guess this is because you would like to get some Hangul characters into the HL summit names. If you have access to the summit list file for HL then if you edit just the 1st ten summit names to have Hangul then Latinised in brackets like so:
and send that to me I will try it on my test system and send you the results for you to look at. If it works then we know the process needed to update the files for HL and they can be loaded during the big reload.
Yes they are. You would have heard from me if they werenât
Not necessarily. You can put a BOM at the start on any Unicode file, whatever its transformation encoding. In some quarters there is a bit of a holy war about this: some folk say there should always be a BOM to help you determine the encoding, and others say that the very concept is the spawn of the devil and BOM is a spurious character which just gets in the way. The bottom line is that if there isnât one, you may have to guess the encoding, and UTF-8 tends to be a good bet.
UTF-8, UTF-16 and UTF-32 can all represent any legal Unicode character and you can convert between them without loss, so it doesnât really matter which of those is used internally. UCS-2 is restricted to 16-bit characters.
Back in the day I managed with a six bit Character Set on 1900 series ICL mainframes⊠it never did me any harm⊠well apart from ALWAYS HAVING TO WORK IN UPPERCASEâŠ
This looks like good progressâŠhowever, when making the original database files I stuck the Hangeul and Hanja in different columnsâŠwould it be possible to to adjust the database to handle it this way? Unless you could have more than one pair of brackets in one cellâŠWhy wouldnât different pieces of data be separated?
Also, are we limited to two character sets per association? In a perfect world, thereâd be Hangeul, Hanja, then romanisedâŠ
73 & good work from the âgive them an inch and theyâll take a mileâ deptâ
HL4ZFA
hmmmâŠdidnât change anything with the sharing propertiesâIâm actually not the owner. Could it be simply it changed places (timewise, last edit)? If searching for âHLâ in your drive page doesnât turn it upâ
Whatâs the address that it should be shared with? Itâs currently accessed by 13 peopleâŠ
I did send you an invite, to your moose address, but I donât know how you were accessing it beforeâŠmaybe this nudged one of your ghosts that was hiding in the sheetsâŠ
Looks greatâexactly how it should. Aside from looks ^^ how does it search? Would searching with a partial string still give us our results? IE entering only Hangeul, when there exists a Hanja equivalent alsoâŠ
(I doubt anybody will be searching with Hanja, but you never knowâitâs more for informational/meaning purposesâŠbut I suppose itâd be fun to eek out all the chicken mountains, for example, and create an award for activating all of them ă ă ă )
Now, the other big questionâŠis there a way thatâs easier to give you the data than the parenthesis notation? 2500 summits might take a while, especially if I canât figure out the macroâŠ
Digging up old graves here, but it reappeared to my attention that while non-latin characters might be enjoying their heyday on the database, somethingâs gone awry with the sotawatch summit pagesâŠHangeul had formerly (2010-2014) displaying with no issues but now show up as can be seen in the bottom link:
However, when I click edit, the Hangeul shows up:
Though, if I think Iâm safe and choose âUpdate Linkâ the Hangeul disappears and is replaced with question marks â??? Map ??â, should I click edit once again, no more Hangeul, just the question marks. It wonât be such an issue if the summits get their bilingual listings in the future, as thatâs mainly what I was using the external links for (displaying summit names in native script).
As an interesting and totally unrelated side note, after having added my last link of the day (in an effort to catch up and string together some other local activatorsâ reports) I noticed the summit info count hit the magical number that many of us out there might have as their luggage lock combination⊠^^