In reply to M1MAJ:
" I cannot see any reason not to store UTF-8 data in a VARCHAR field" - actually, there is a very good reason why the VARCHAR data type cannot store more than a handful of UTF-8 characters - see, e.g. for starters sql server - What is the difference between varchar and nvarchar? - Stack Overflow, where the following may be found:
"An nvarchar column can store any Unicode data. A varchar column is restricted to an 8-bit codepage. Some people think that varchar should be used because it takes up less space. I believe this is not the correct answer. Codepage incompatabilities are a pain, and Unicode is the cure for codepage problems. With cheap disk and memory nowadays, there is really no reason to waste time mucking around with code pages anymore.
All modern operating systems and development platforms use Unicode internally. By using nvarchar rather than varchar, you can avoid doing encoding conversions every time you read from or write to the database. Conversions take time, and are prone to errors. And recovery from conversion errors is a non-trivial problem.
If you are interfacing with an application that uses only ASCII, I would still recommend using Unicode in the database. The OS and database collation algorithms will work better with Unicode. Unicode avoids conversion problems when interfacing with other systems. And you will be preparing for the future. And you can always validate that your data is restricted to 7-bit ASCII for whatever legacy system you’re having to maintain, even while enjoying some of the benefits of full Unicode storage.
"
Note the bit about 8-bit codepage - the Wikipedia has this to say about UTF-8:
“UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed “octets” in the Unicode Standard). Code points with lower numerical values (i.e. earlier code positions in the Unicode character set, which tend to occur more frequently) are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.”
So, the first 128 characters of ANY codepage will be stored correctly in a VARCHAR field - the other 128 characters, as I understand it, will be stored according to the codepage defined in the database schema. To get around all these problems, a NVARCHAR data type should be used, so that Unicode can be stored and - importantly - indexed correctly in the db.
So, VARCHAR (at least in MS SQL Server) should not be used in a DB to store multinational characters:
"1) Indexes can fail when not using the correct datatypes:
In MSSQL: When you have an index over a VARCHAR column and present it a Unicode String, MSSQL-Server does not make use of the index. The same thing happens when you present a BigInt to a indexed-column containing SmallInt. Even if the BigInt is small enough to be a SmallInt, SQL-Server is not able to use the index. The other way around you do not have this problem (when providing SmallInt or Ansi-Code to an indexed BigInt ot NVARCHAR column).
- Datatypes can vary betweeen different DBMS’s (DataBase Management System):
Know that every database has slightly different datatypes and VARCHAR does not means the same everywhere. While MSSQL has VARCHAR and NVARCHAR, an Apache/Derby database has only VARCHAR and there VARCHAR is in Unicode."
Horses for courses…
73, rob
DM1CM