Volodymyr Dvernytskyi
Personal blog about Navision & Dynamics 365 Business Central

Working with text. Custom text encoding.

It may seem like working with text is one of the simplest things to do. However, in reality, it is quite a deep topic and not as easy as it may appear at first glance. In this article, we will cover the main approaches to working with text, as well as discuss the issue of text encoding.

Base Text Operations

I think it makes sense to say that there is an "old way" and a "modern way" to work with text in BC. There are new methods in Business Central that are not available in Navision. Sometimes they duplicate each other, but often they work differently.

Let's say we need to copy the first 20 characters from the text, this can be done in different ways, for example:

local procedure TestCopyText()
var
    SomeText: Text;
begin
    SomeText := 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed id luctus sapien.';

    //Strange old way
    Message(DelStr(SomeText, 21));

    //Old way
    Message(CopyStr(SomeText, 1, 20));

    //Strange modern way
    Message(SomeText.Remove(21));

    //Modern way
    Message(SomeText.Substring(1, 20));
end;

We will not dwell on each text method, as it is all described in detail in the documentation. I'd rather tell you about the new methods that I like the most. They are: Contains, Replace and Split, these methods are quite easy to use and very useful:

local procedure TestModernMethods()
var
    SentenceList: List of [Text];
    Sentence: Text;
    SomeText: Text;
    SearchText: Text;
begin
    SomeText := 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed id luctus sapien.';
    SearchText := 'sit';

    //Search for substring
    if SomeText.Contains(SearchText) then
        Message('Text contains %1', SearchText);

    //Remove spaces in string
    Message('Text with removed spaces:\ %1', SomeText.Replace(' ', ''));
    //Replace "a" symbol with "|"
    Message('Text with replaced "a" to "|" symbol:\ %1', SomeText.Replace('a', '|'));

    //Split string to senteces based on "." separator
    SentenceList := SomeText.Split('.');
    foreach Sentence in SentenceList do
        if Sentence <> '' then
            Message('Sentence is:\ %1', Sentence);
end;

In addition, the new methods are available for any text instance. This means that we can access the fields in the table directly through these methods, for example:

local procedure TestTableFields()
var
    Customer: Record Customer;
begin
    if not Customer.FindFirst() then
        exit;

    if Customer.Name.Contains('Adatum') then
        Message('Customer %1 contains %2 in name.', Customer."No.", 'Adatum');

    //Customer name in lowercase
    Message('Customer name in lowercase:\ %1', Customer.Name.ToLower());

    //First 5 symbols of Customer Address
    Message('First 5 symbol of Customer Address:\ %1', Customer.Address.Substring(1, 5));
end;

Important: substring, removing, and other methods can give you runtime errors in case the index is outside the text range. For example, if you trying to get a substring from 1 to 20 symbols and the string contains only 15 symbols runtime error will occur.

Advanced Text Operations

In this block, we'll look at the more complex tasks that developers often face when working with text. Let's start with TextBuilder data type, which is only available in Business Central out of the box. With TextBuilder it is quite convenient to collect lines of text by splitting them with TextBuilder.AppendLine() which always puts a line terminator at the end.

local procedure TestTextBuilder()
var
    LineStorage: TextBuilder;
begin
    //Insert Line 1
    LineStorage.AppendLine('Lorem ipsum dolor sit amet, consectetur adipiscing elit.');
    //Insert empty Line 2
    LineStorage.AppendLine();
    //Insert Line 3
    LineStorage.AppendLine('Sed id luctus sapien.');

    //Result text will contain line breaks, but Message may not show it in UI
    Message(LineStorage.ToText());
end;

Let's say we have a text file that consists of several lines of data and columns separated by ";" symbol. Then, one of the correct ways to parse it's to split lines by CRLF separator and split each of these line by column separator. To get CRLF separator I use TypeHelper codeunit, which has a huge number of useful features, I suggest you pay attention to it.

local procedure TestTextParse()
var
    TypeHelper: Codeunit "Type Helper";
    LineStorage: TextBuilder;
    ParsedResult: Dictionary of [Integer, Text];
    LineList: List of [Text];
    LineValue: Text;
    ColumnValue: Text;
    i: Integer;
begin
    //Emulate text file with line breaks and ";" column separator
    LineStorage.AppendLine('No_;Name;Balance;Description');
    LineStorage.AppendLine('1000;Adatum;231.2;Some Description');
    LineStorage.AppendLine('2000;Test;0;It is test');

    LineList := LineStorage.ToText().Split(TypeHelper.CRLFSeparator());
    //Read data line by line
    foreach LineValue in LineList do
        //Read column by ";"
        foreach ColumnValue in LineValue.Split(';') do begin
            //Just demonstration, it's can be whatever you need to do with Lines and Columns
            i += 1;
            ParsedResult.Add(i, ColumnValue);
        end;
end;

What if the text data is in a BLOB field? How to correctly read this text in Business Central? For correct reading from BLOB/Media, we need to know the encoding of the text in which it was written. In addition, we need to remember that the text sometimes contains line breaks or CRLF separators. So we need to use a loop to make sure that the data is read correctly.

local procedure TestReadTextFromBLOB()
var
    SalesHeader: Record "Sales Header";
    TypeHelper: Codeunit "Type Helper";
    TextInStream: InStream;
    Result: Text;
    Line: Text;
begin
    SalesHeader.SetRange("Document Type", SalesHeader."Document Type"::Order);
    if not SalesHeader.FindFirst() then
        exit;

    //Initialize InStream from BLOB, read with same TextEncoding as it was recorded
    SalesHeader."Work Description".CreateInStream(TextInStream, TextEncoding::UTF8);
    //Loop for each line into InStream, yes, BLOB may contain line breaks
    while not TextInStream.EOS() do begin
        //Read and store each line to variable
        TextInStream.ReadText(Line);
        //Accumulate result and restore line break to show same data as it was in BLOB
        //Be careful with CRLF separator, this logic will always add additional line break
        Result += Line + TypeHelper.CRLFSeparator();
    end;

    //Result text will contain line breaks, but Message may not show it in UI
    Message(Result);
end;

Text Encoding

Now let's consider the issue of text encoding, which is quite a broad topic and one that can cause a number of problems when working with text. Simplistically speaking, each text encoding can be thought of as a table that maps a certain number to a specific glyph. Different encodings have different mapping tables, so, for example, a certain number may be decoded into different symbols depending on the encoding used. This also works in the opposite direction, where the same symbol can be decoded into different numbers depending on the encoding tables. Sometimes, several numbers can be decoded as one symbol by some encodings, but other code pages can show you several symbols. To check the supported code page identifiers for Windows, visit the documentation.

That is the reason for the appearance of an incomprehensible mixture of characters when reading or writing data. If we make a mistake with the encoding, we will see fully or partially corrupted data. Therefore, when working with text, it is important to read the text in the same encoding in which it was previously written.

The most complete and comprehensive standard of text encoding is Unicode. It is the predominant encoding used in all applications and websites worldwide. It is strongly recommended to use Unicode for text encoding. For example, if you use Windows encoding in Business Central for writing, the result may differ depending on the Windows localization settings.

This also sheds some light on the difference between OutStream.Write() and OutStream.WriteText(), or InStream.Read() and InStream.ReadText(). If we want to write bytes directly without the context of encodings, we need to use Read()/Write(). For example, if we want to write any binary file such as .doc/.pdf/etc. But if we are working with text, then we need to use ReadText()/WriteText() with the correct combination of text encoding.

Business Central supports a limited number of text encodings, specifically: MSDos, UTF8, UTF16, and Windows. Here is an example with reading/writing in standard BC text encodings:

local procedure ExampleTextDataReadWrite()
var
    TempBlob: Codeunit "Temp Blob";
    TypeHelper: Codeunit "Type Helper";
    TextInStream: InStream;
    TextOutStream: OutStream;
    TextData, Line : Text;
begin
    TempBlob.CreateOutStream(TextOutStream, TextEncoding::UTF8);

    TextOutStream.WriteText('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed id luctus sapien.');

    TempBlob.CreateInStream(TextInStream, TextEncoding::UTF8);

    while not TextInStream.EOS() do begin
        TextInStream.ReadText(Line);
        TextData += Line + TypeHelper.CRLFSeparator();
    end;

    Message(TextData);
end;

Custom Text Encoding in BC

But what should we do if we still encounter a specific encoding that we cannot avoid? Perhaps a third-party application can only work with the Japanese Shift_jis encoding, for example? If there is no possibility of transitioning to Unicode, there are several ways to solve this problem in Business Central. I will show one of the options that I use. In this case we will use binary data instead of text.

local procedure ExampleShiftJIS()
var
    FileInStream: InStream;
    TextData, FileName : Text;
begin
    TextData := 'ｸｹｺｻｼｸｹｺｻ';
    WriteTextInCustomEncoding(TextData, 932).CreateInStream(FileInStream);
    FileName := 'test.txt';
    DownloadFromStream(FileInStream, '', '', '', FileName);
end;

procedure WriteTextInCustomEncoding(TextData: Text; CodePage: Integer) TempBlob: Codeunit "Temp Blob"
var
    DotNetEncoding: Codeunit DotNet_Encoding;
    DotNetStreamWriter: codeunit DotNet_StreamWriter;
    FileOutStream: OutStream;
begin
    TempBlob.CreateOutStream(FileOutStream);

    DotNetEncoding.Encoding(CodePage);
    DotNetStreamWriter.StreamWriter(FileOutStream, DotNetEncoding);
    DotNetStreamWriter.Write(TextData);
    DotNetStreamWriter.Dispose();
end;

Method WriteTextInCustomEncoding() will write TextData to TempBlob with custom code page. In our specific example method ExampleShiftJIS() write some Japanese symbols in Shift_jis(932) encoding and result will be loaded from BC to text file.

The main idea of this method is to write the correct bytes (numbers) based on the input code page. Essentially, the DotNetStreamWriter.StreamWriter() function searches for the necessary bytes based on the text glyphs according to code page. As a result, we will get a file like this:

At first glance, it looks incorrect, but in fact, if we compare the numbers relative to the Shift_jis glyphs, we will find that the bytes are correct. So why don't we see the correct text? It's very simple, in fact, text editors cannot predict with 100% certainty which encoding should be applied to the opened file. Often the file is displayed in some standard encoding, in this case, the screenshot shows that it is Windows-1252. All we need to do is switch to displaying the correct encoding in the text editor. You will also notice that the bytes are the same, but due to the change of code page, the glyphs are decrypted in the correct format:

What happens if we try to generate a file by writing Japanese glyphs in the Unicode standard, specifically in utf-16?

local procedure ExampleUnicodeUTF16()
var
    FileInStream: InStream;
    TextData, FileName : Text;
begin
    TextData := 'ｸｹｺｻｼｸｹｺｻ';
    WriteTextInCustomEncoding(TextData, 1200).CreateInStream(FileInStream);
    FileName := 'test.txt';
    DownloadFromStream(FileInStream, '', '', '', FileName);
end;

We will suddenly discover that the file opens in the correct encoding, and the glyphs are correct, but the bytes are different. What is the reason for this? Well, first of all, we already know that the same glyph can have different numbers or combinations of them, so the new table of correspondences has its own codes for the same glyphs. There is still one question left: why did the file open in the correct encoding? It's all because of the byte order mark (BOM), which is a number that is placed at the beginning of the file and tells the text editor in which encoding the file should be opened. In our case, it's FF FE, the standard number for utf-16 1200.

To read binary data in a custom text encoding, we can use similar objects from Business Central:

local procedure ExampleReadShiftJIS()
var
    TextData: Text;
    FileInStream: InStream;
begin
    TextData := 'ｸｹｺｻｼｸｹｺｻ';
    WriteTextInCustomEncoding(TextData, 932).CreateInStream(FileInStream);

    Message(ReadTextInCustomEncoding(FileInStream, 932));
end;

procedure ReadTextInCustomEncoding(var TextDataAsInStream: InStream; CodePage: Integer) Result: Text
var
    DotNetEncoding: Codeunit DotNet_Encoding;
    DotNetStreamReader: codeunit DotNet_StreamReader;
begin
    DotNetEncoding.Encoding(CodePage);
    DotNetStreamReader.StreamReader(TextDataAsInStream, DotNetEncoding);
    Result := DotNetStreamReader.ReadToEnd();
    DotNetStreamReader.Dispose();
end;

Summary

This is just the tip of the iceberg, you know, dealing with text is not that simple. But we have covered the topic of working with text, the main methods and approaches, and even modern text instance methods. We have also reminded ourselves of the differences between binary and text encoding, what encoding is, and how to deal with it. I hope this information was useful to you. See you soon!

May 11, 2023

Author: Volodymyr Dvernytskyi