Problem with umlaute in SOTAWatch

DB7MM · 13 September 2013 23:01

During the last time I am experiencing some problems with character coding on the SOTAWatch summit pages. Many German summit names contain umlaute (Ã¤Ã¶Ã¼) and the sharp s (ÃŸ). Instead of theses characters I now see a placeholder, a question mark in a diamond. Seems the pages are set to character coding UTF-8. If I manually switch to Western ISO-8859-1, Western ISO-8859-15 or some other coding, everything is ok. Of course only until I reload the page which again switches back to UTF-8 coding. I am using Firefox 23.0.1 on Windows XP 32-Bit. I also tried with Internet Explorer 8.0 - same problem.
Other webpages including the SOTA database are fine even with UTF-8.
Edit: Even the Reflector i.e. this posting is fine with UTF-8!

Does anyone else experience such problems?

73 de Michael, DB7MM

OE5EEP · 13 September 2013 23:17

In reply to DB7MM:

Same here with Firefox 23.0.1 and Win7.

Test: Ã¤Ã¶Ã¼Ã„Ã–ÃœÃŸ end of test

Heinz

DM1CM · 14 September 2013 10:25

In reply to DB7MM:

Servus Michael,

Same here - I have a suspicion that the encoding may not have been set for the database, and/or database connection being performed in the background for the pages in question.

If UTF-8 is to be used in a web-based system, then ALL components of the system should be using UTF-8 - all except source-code, that is

Rob DM1CM

M1MAJ · 14 September 2013 12:07

I reported this years ago. The problem is that the pages declare themselves to be encoded in UTF-8 (as of course they should be) but the text in question is actually encoded in Windows code page 1252. I think Jon did try a few changes but did not get to the bottom of the matter.

The database information itself is mostly fine. The CSV export from the database has correct UTF-8 (apart from the Ukranian names, which seem to have been irretrievably garbled long ago). Something is going wrong between the main database and SOTAwatch.

DB7MM · 15 September 2013 00:09

Thanks all for the confirmations.

In reply to M1MAJ:

I reported this years ago.
Maybe this was broken some time in the past. But it was ok for a long time and now the problem occurs again.

Something is going wrong between the main database and SOTAwatch.
Exactly. Native data like the reflektor are ok but all data coming from the database are not coded correctly in UTF-8.

73 de Michael, DB7MM

DM1CM · 15 September 2013 00:50

In reply to DB7MM:

There’s no such thing as “native data” - I would suspect the reflector data are also stored in the database somewhere.

It would seem that the problem lies in the way data are stored in, and then passed on by, MS SQL Server to the C# routines which are used by SOTA. If summit names data are stored as nvarchar, rather than as varchar in SQL Server, then it’s far easier to do the UTF-8 conversion in C#/.NET. If they are stored as varchar, however, then it’s possible to have the db return binary data (using something like “SELECT CONVERT(varbinary(20), SummitName) AS BinSummitName FROM SOTASummits”) which can then be converted to UTF-8 with something like System.Text.UTF8Encoding.GetString().

Yuk! - it’s all MUCH easier to do with PHP/MySQL - simply issue a “SET NAMES UTF8” command when opening a DB connection.

Just my $0.02 worth…

Rob

G4ZFZ · 15 September 2013 10:57

In reply to DB7MM:

Sorry about this problem. Some responses:

This seems to have started about a couple of weeks ago. DB7MM was correct. I got all the encoding ducks in a row quite a while back and all was fine for quite a while.
I’m not sure now what the problem is and haven’t had time over the last couple of weeks to do anything more than a quick investigation. Obviously something has changed.
Nothing in the code has changed. But it is possible that a routine update to system software on the Linux box I am running this on may have altered something. I am assuming nothing has changed to the incoming data from the database.
The reflector data is stored locally in a MySQL database whereas the Summit data is imported from the database.
I hope to get some time soon to look into it more thoroughly, and meanwhile I have asked a friend to also take a look.

Again, sorry for the inconvenience. It’s unfortunately caught me in the middle of a particular busy time at work.

Hope to have it fixed before too long.

73, Jon

M1MAJ · 15 September 2013 19:10

In reply to GM4ZFZ:

Nothing in the code has changed.

I believe this. The pages being delivered from the server show exactly the same problem as they always did. The characters being complained of are encoded in Windows Code Page 1252 rather than UTF-8. I have just downloaded a sample page “raw” (using “curl”) and it is the same now as it has been for years. For example, the “sharp-s” is encoded as the single byte DF.

Of course each such character is an invalid UTF-8 sequence. A single isolated byte with the top bit set does not encode anything in UTF-8 - there must always be at least 2 consecutive bytes with the top bit set. The pages as delivered cannot be decoded as UTF-8.

It always surprised me that despite this defect, by and large the pages did display correctly. I could only explain this by assuming that the browser did not diagnose the invalid UTF-8 sequence, but let it pass through to be interpreted as a CP1252 encoding instead. This is exactly the fudge that I had to adopt when decoding the pages for the Twitter feed.

The correct treatment of an invalid UTF-8 sequence is either to abort the entire operation (which in this case would be unhelpful) or substitute the character with something that indicates the character was in error. The latter seems to be exactly what is happening now.

I am therefore forced to the conclusion that it is browser behaviour that has changed. Previously the treatment of the error was superficially helpful but wrong; it is now correct.

In summary, I am convinced that you should be looking for a long-standing problem, not a new one.

DM1CM · 16 September 2013 14:46

In reply to M1MAJ:

Martyn is entirely correct in what he says here, in particular with the way in which user-agents (in this case, browsers) handle, or have handled, invalid UTF-8 sequences.

The actual problem lies somewhere deep in the system, and in a place or subsystem which does not understand, or has no provision for, UTF-8. My suspicion is that it lies in the fact that the SQL Server database system has NO support for UTF-8 collation, whether on a field, or table, or database level. This is in contrast to some other database systems currently available: Oracle, PostgreSQL, MySQL… We can see that the Reflector data coming from the MySQL database are being handled correctly, whereas those coming from “the database” - i.e. the SOTA system comprising MSSQL Server database + application - are not.

The best SQL Server can do is to store text data in NVARCHAR data-type, where the data are actually stored in UTF-16, or more accurately in UCS-2, format. Then it’s a matter for systems accessing these data (data-access objects in programming languages, for example) to be able properly to handle any required conversion to UTF-8. Now, .NET is SOMEWHAT capable of doing this; PHP is CERTAINLY able to do this. Time for a change in systems?

My $0.03 worth…

Rob

MM0FMF · 16 September 2013 20:29

In reply to DM1CM:

Oh what a tangled web we weave…

I have a simple solution that is somewhat UK-centric but may not be so well received. W. Shakespeare, Esq. wrote some stunning prose using only 24 letters. If he can do that you’d think we could represent some hill names using the same 24(26) letters without need for squiggles and things.

No, I didn’t think there’d be much support for that! Right serious time.

It’s a long time since I’ve looked at encoding schemes. But here are some observations of what we have.

Ukrainian summits using Cyrillic (I think) are broken.
Looking on my local Win7 machine at the database with MSSM studio shows umlauts and things rendering correctly.
Looking at the database web server output on FF 23.0.1 Win7 shows umlauts etc. rendering OK.
Data for names is stored in Varchar types.
Encoding on the database hasn’t changed ever.
UTF-8 encoding is selected for generating the CSV files.

This is what we have. I’ve not looked at what is coming back from the database as a raw byte stream, so the correct umlaut etc. display I see may well be the browser/MSSM Studio fixing broken encodings.

The only encoding we can’t show are Ukranian & Korean. They do need fixing so that the data is stored correctly and then rendered correctly. This important for our users who have limited Roman charset experience and because we will be adding more associations that don’t use Roman alphabets. However, it’s not the highest job on the priorities at present.

Andy
MM0FMF

K6EL · 16 September 2013 20:42

In reply to MM0FMF:
.
How about O-slash (Ã˜) for zeros, Andy, as in MMÃ˜FMF.

Elliott, K6EL
troublemaker Ã˜Ã˜1

M1MAJ · 16 September 2013 21:37

In reply to K6ILM:

How about O-slash (Ã˜) for zeros, Andy, as in MMÃ˜FMF.

You shouldn’t really do that. The character is a zero, and should always be encoded as such. If you want it to appear with a slash to distinguish it from the letter O, you should choose a FONT in which zero looks like that. It’s a horrible hack to make it a different character just to make it look different.

MM0FMF · 16 September 2013 22:04

In reply to M1MAJ:

Tee hee. My 0 and O do look different!

Andy
MM0FMF

M1MAJ · 16 September 2013 22:21

In reply to MM0FMF:

It’s a long time since I’ve looked at encoding schemes.

There is an infamous exam question that was set in the University of Cambridge Computer Science examination, sometime in the 1970’s I think, which reads, in its entirety:

“Why do intelligent people designing operating systems get into apparently ridiculous messes about character codes?”

The candidate was expected to write a half hour essay.

Of course the world was very different then and the answer being sought then was rather different from the one a student might give now. But somehow, not a lot has changed, even though character encoding is by and large a solved problem.

M1MAJ · 16 September 2013 22:34

In reply to MM0FMF:

Data for names is stored in Varchar types.

That does present a problem for supporting a richer character set than we already have. But this should not prevent support of the existing characters in SOTAwatch.

The summits.csv file is existence proof of a mechanism to export the database data encoded in Unicode/UTF-8, even if you can’t currently support the full Unicode character set.

We (the users) don’t know how the data gets from the main database to SOTAwatch, but there is circumstantial evidence that the mechanism uses Windows native encoding. If this transfer could be changed to use UTF-8, it might solve the current display problem AND pave the way to a transparent upgrade to full character set support (i.e. NVARCHAR in the database) at a later date when time permits.

DB7MM · 18 September 2013 23:08

In reply to GM4ZFZ:

Hope to have it fixed before too long.

Seems you have fixed it already. I just had a look at the DM/BM summit list and everything is perfect. Thanks for the quick fix.

73 de Michael, DB7MM

MM0FMF · 19 September 2013 00:14

In reply to M1MAJ:

That does present a problem for supporting a richer character set than we already have.

It’s not too hard to enhance it. Thinking on the fly solution is to use the current VARCHAR field to hold Roman alphabets and to use a new table for languages which don’t. A bit kludgy but would work and we only need extra space for a few languages rather than all of them.

Andy
MM0FMF

M1MAJ · 1 October 2013 14:54

In reply to MM0FMF:

The only encoding we can’t show are Ukranian & Korean.

Actually, I have realised that if the internal encoding in the database is Windows-1252, you can’t do Polish properly. The letters Å and Å‚ are missing. So, to pick an example at random, the database has the name of SP/BZ-006 wrong relative to the ARM.

Rob DM1CM pointed out that MSSQL has no support for UTF-8 in VARCHAR fields, and I have no reason to doubt that. However it seems that what is missing is collation. But do we need that? Is there anywhere in any of the database queries that needs ORDER BY on the summit name column? I cannot imagine that there would be.

If that’s right, I cannot see any reason not to store UTF-8 data in a VARCHAR field. It’s just bytes in and bytes out, and provided you don’t do any automatic conversions, the entire dataflow could be UTF-8 throughout. The database doesn’t care what the bytes mean (as long as you don’t need to sort). Obviously the initial 1-off conversion of the existing data would be a pain, and the correction of the data errors would be an even bigger job (but could be done at leisure as time permits).

Might this be the way to go? It avoids making any structural change to the database schema at the cost of having to orchestrate a semantic change.

Martyn M1MAJ

DM1CM · 1 October 2013 15:31

In reply to M1MAJ:

" I cannot see any reason not to store UTF-8 data in a VARCHAR field" - actually, there is a very good reason why the VARCHAR data type cannot store more than a handful of UTF-8 characters - see, e.g. for starters sql server - What is the difference between varchar and nvarchar? - Stack Overflow, where the following may be found:

"An nvarchar column can store any Unicode data. A varchar column is restricted to an 8-bit codepage. Some people think that varchar should be used because it takes up less space. I believe this is not the correct answer. Codepage incompatabilities are a pain, and Unicode is the cure for codepage problems. With cheap disk and memory nowadays, there is really no reason to waste time mucking around with code pages anymore.

All modern operating systems and development platforms use Unicode internally. By using nvarchar rather than varchar, you can avoid doing encoding conversions every time you read from or write to the database. Conversions take time, and are prone to errors. And recovery from conversion errors is a non-trivial problem.

If you are interfacing with an application that uses only ASCII, I would still recommend using Unicode in the database. The OS and database collation algorithms will work better with Unicode. Unicode avoids conversion problems when interfacing with other systems. And you will be preparing for the future. And you can always validate that your data is restricted to 7-bit ASCII for whatever legacy system you’re having to maintain, even while enjoying some of the benefits of full Unicode storage.
"

Note the bit about 8-bit codepage - the Wikipedia has this to say about UTF-8:

“UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed “octets” in the Unicode Standard). Code points with lower numerical values (i.e. earlier code positions in the Unicode character set, which tend to occur more frequently) are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.”

So, the first 128 characters of ANY codepage will be stored correctly in a VARCHAR field - the other 128 characters, as I understand it, will be stored according to the codepage defined in the database schema. To get around all these problems, a NVARCHAR data type should be used, so that Unicode can be stored and - importantly - indexed correctly in the db.

So, VARCHAR (at least in MS SQL Server) should not be used in a DB to store multinational characters:

"1) Indexes can fail when not using the correct datatypes:
In MSSQL: When you have an index over a VARCHAR column and present it a Unicode String, MSSQL-Server does not make use of the index. The same thing happens when you present a BigInt to a indexed-column containing SmallInt. Even if the BigInt is small enough to be a SmallInt, SQL-Server is not able to use the index. The other way around you do not have this problem (when providing SmallInt or Ansi-Code to an indexed BigInt ot NVARCHAR column).

Datatypes can vary betweeen different DBMS’s (DataBase Management System):
Know that every database has slightly different datatypes and VARCHAR does not means the same everywhere. While MSSQL has VARCHAR and NVARCHAR, an Apache/Derby database has only VARCHAR and there VARCHAR is in Unicode."

Horses for courses…

73, rob
DM1CM

M1MAJ · 1 October 2013 16:51

In reply to DM1CM:

I take your point Rob, but I think you might be missing mine (I’m not sure).

Every conceivable UTF-8 string, viewed simply as a sequence of bytes, is a valid string in an 8-bit code page such as Windows-1252. (The converse is not true of course). It is not the same string when interpreted as an 8-bit code page, but the database cannot possibly “know” that the string you put in is actually in UTF-8 by stealth. You could just as easily enter it in Morse code - it’s still just a string of 8 bit bytes. It certainly will not sort correctly, but I don’t think there is any requirement for that. Provided that you don’t let the database do any automatic character set conversions, what you put in should come out. Of course Andy would see complete junk if he looked at the tables in SQL Server Managment Studio or whatever.

Obviously it is nowhere near as good as doing it properly with NVARCHAR, but I thought it might be a relatively easy way to establish a UTF-8 workflow without the pain of changing the database schema. Clearly it is Andy’s choice whether he does this, something better, or nothing!

Martyn M1MAJ