Understanding CSV files.
The problem with CSV
In theory, CSV files should be simple. They might not have the rich formatting abilities of an xls/xlsx file, but they are just text, and text is simple. They should be much better for interoperability between apps, because every app should be able to understand text, so they should be able to parse CSV files without much effort.
But in practice CSV is a poor format to share data to unknown apps, and the reason is the lack of a standards-defined CSV file format. Or actually the problem is not that there is not a standard, because there is one: https://tools.ietf.org/html/rfc4180
The problem is that nobody implements the standard. If you create CSV files which conform to the standard, most apps that can load CSV files (including Excel, Open Office and Google docs) will fail to open the files.
CSV files are what is known as a "de-facto" standard, and by de-facto we mean "Valid CSV files are the files that Excel generates". Other apps or tools will try to open and save "Excel-CSV" files, not "Standard-CSV" files, and those tools include FlexCel. FlexCel will generate Excel-compatible-CSV, so your users can open those CSV files with Excel or other tools that understand "Excel-compatible-CSV". In the same way, when reading FlexCel will assume the file is also Excel-compatible-CSV, because most CSV files out there are.
Ok, so going back to the title, what is the problem with CSV? The problem is that "Excel-compatible-CSV" is not a single standard: Different Excel versions and languages will generate different files. There are multiple CSV standards, and when you open a file you have to know which standard was used to generate it. When you create a CSV file, you must know which application will read the CSV file, and save the file in a format the target application will understand.
In the rest of this document we will discuss the differences.
The locale
Different languages use different decimal and thousands separators. In English, the decimal separator is the dot, and the thousands separator is the comma. So we would write the number "one thousand, ninety-nine point twenty-five" as "1,099.25". But in most other languages, the comma and the dot are reversed, and for example in Spanish, the same number would be written as "1.099,25".
Different languages also have different date ordering. In English, 2/1/1998 means "February 1, 1998". In many other languages it means "January 2, 1998"
Now imagine that you are reading a CSV file generated by an unknown application, and you find the number "1.234" in the file. Does this number mean "1 dot 234" or "1 thousand, 2 hundred and thirty-four"? You can't really know. If the file was created by some user with Excel in a Spanish machine, then it is 1234. If the user was in an English machine, then it is 1.234.
The same problem happens when you want to create a CSV file that other application(s) will read. Do you save "February 2, 1998" as "2/1/1998" or "1/2/1998"? You need to know who is going to read the file, and save a file they will understand.
The separator
It is there in the name: "C S V" stands for "Comma-Separated Values" So you would expect the different columns in a CSV to be separated by commas. But probably because in non-English locales the comma is used as the decimal separator, in non-English locales Excel uses the semi-colon (;) as the field separator. To avoid the confusion between "," and ";", some other applications use a tab (character 9) to separate between fields (this is also known as "TSV" or tab-separated values).
Once again, when you read a CSV file you need to know which separator was used in the file. When creating the file, you need to know which separator the application reading the CSV file will expect.
The encoding
Finally, you need to know the encoding to save the file. Excel by default uses the Windows' encoding when saving to CSV, and this means the meaning of the bytes in the CSV file vary depending on the machine reading them.
If for example a user saves a CSV in a Russian-locale machine, the file will be saved with a Windows-1251 Encoding This means that if he writes the character "Д" in the file, it will be saved as the character 196. When a user with an English locale (Windows-1252)opens the file, he will see an "Ä" instead since that is what the number 196 represents in Windows-1252.
In an ideal world, we would be using UTF-8 as the universal encoding, but we are far from living in an ideal world. In this real world, once again, you need to know who is going to read the file and save the file in the encoding they expect. When reading files, you need to know what encoding was used to create them.
Important
In Windows encodings like 1252 every byte is a valid character, so there is no way to know if the encoding is incorrect. The text you are reading might make no sense, but any combination of bytes is valid. But in UTF8 many combinations of characters are invalid, so when loading an UTF8 document, the UTF8 decoder can normally detect invalid documents.
What the decoder does when detecting invalid UTF-8 sequences depends in the OS and platform you are using: Sometimes it will change the invalid character by a "?" or a white square, but on others it might refuse to decode the full text.
If you are getting 0 rows when importing a CSV file in FlexCel, you most likely are using UTF-8 and the file is not in UTF-8. The UTF-8 decoder is detecting invalid characters and passing an empty string to FlexCel. When FlexCel loads the empty string, it will produce an empty document.
Dealing with CSV files in FlexCel
As discussed in the previous section, there are 3 things you need to know when reading and writing CSV files: The locale, the separator and the encoding.
To specify the locale of the files you are reading or writing, you need to change the FlexCel locale.
You specify the separator and the encoding when you call TExcelFile.Export or TExcelFile.Save to save the file, or when you call TExcelFile.Open or TExcelFile.Import to open the file.
Below is an example about how to open a file saved in a Spanish locale, using ";" as field separator and Win1232 as Encoding, and save the file using "," as field separator, with a USA locale, and UTF-8 encoding:
var
SaveLocale: TFlexCelFormatSettings;
xls: TXlsFile;
Win1252Encoding: TEncoding;
begin
SaveLocale := TFlexCelFormatSettings.PushThreadFormat('es-ES', TFormatSettings.Create('es-ES'));
try
xls := TXlsFile.Create(true);
try
Win1252Encoding := TEncoding.GetEncoding(1252);
try
xls.Open('test.csv', TFileFormats.Text, ';', 1, 1, nil, Win1252Encoding, true);
finally
Win1252Encoding.Free; //We must free Encodings created with GetEncoding.
end;
TFlexCelFormatSettings.SetThreadFormat('en-US', TFormatSettings.Create('en-US'));
xls.Save('result.csv', TFileFormats.Text, ',', TEncoding.UTF8);
finally
xls.Free;
end;
finally
TFlexCelFormatSettings.PopThreadFormat(SaveLocale);
end;
Important
As if it wasn't enough with all the issues discussed in this documents, CSV files can also be slow to read. This is because for every cell we read, we have to figure out the cell type. Please read the CSV section in the Performance Guide for more information.
Conclusion
It would make (a lot of) sense if we all agreed in a single standard so we would be able to generate "universal CSV files" which everyone could read or write, but that is just not the way it is.
You can use CSV files to interchange data with known applications, but using CSV as a "general" interchange format for unknown users is not a great idea. For example, don't provide a link to download a CSV file in a website unless you know all your users will use the same settings when opening that file. Providing an xlsx file instead is a much safer way to ensure everyone can read it.