Paul Smith’s dev journal

COBOL, GnuCOBOL, and Go: part 1 🔗

April 22, 2020

This is the first in an open-ended series of posts as I explore working with COBOL and integrating it with different tools, languages, and platforms. I wanted a place to record my notes and thinking for future reference, as well as to report out any interesting findings or new code.

Motivation

Recently, my team worked on a small project that involved running some COBOL on a different platform than which it had originally been used. Part of this project involved handling data that had been generated on a mainframe by another COBOL program which was upstream in a processing pipeline, of which the code we were handling was a part.

All textual elements were EBCDIC-encoded. Other fields were various representations of numeric data. Our task was to process the data through the COBOL and retain its semantics. We had a known-good test run output by which to compare our results against. The main thing we needed to do was to set up an environment to run the code.

We lacked ready access to a commercial COBOL compiler or a mainframe environment in which to familiarize ourselves with how the original code worked, other than by reading the source. We chose to stick with our regular platform (a Unix) and use the open source GnuCOBOL compiler project. GnuCOBOL compiles COBOL source by first compiling it to C, and then relies on the platform’s C compiler to produce object code or an executable.

Since our platform assumed ASCII-encoded byte streams for text processing, such as character-wise comparisons and sorting, we chose to preprocess the data to map EBCDIC code points to ASCII before passing it into the program for processing.

This proved to be more challenging than we initially expected. A naïve approach, to map every byte in the file from a range of EBCDIC values to ASCII, using, say, iconv(1), would suffice for textual elements, but would corrupt numeric encodings. We therefore needed to be guided by the individual types of the elements encoded in the file, and make selective conversions based on them.

Since the input file was a straightforward serialization of the data structures defined in the COBOL source, we could use those definitions to guide the processing. Data structures in COBOL are commonly known as “copyboooks”, and are similar to structs in C or records in other languages: individual fields indicate their name and data type, and a grouped, hierarchically, by the programmer to convey intent of use.

What are copybooks?

Copybooks are somewhat like .h files in C, in that they are intended for the declaration of data structures and the inclusion in potentially multiple different source files. COBOL data structures can also be declared inline in a source code file. It appears that the community refers to the collection of data types and variables defined in a COBOL program as a copybook regardless of whether it appears in a separate file for inclusion or as inline text.

Here’s an example copybook for a fictional program that processes jobs of some kind:

       01 JOB-RESULT
           03 JID                   PIC 9(8).
           03 NAME                  PIC X(30).
           03 TIMESTAMP
               05 JOB-DATE
                   07 JOB-MONTH     PIC 99.
                   07 JOB-DAY       PIC 99.
                   07 JOB-YEAR      PIC 9(4).
               05 JOB-TIME
                   07 JOB-HOURS     PIC 99.
                   07 JOB-MINUTES   PIC 99.
                   07 JOB-SECONDS   PIC 99.
           03 JOB-ERROR             PIC 9.
               88 SUCCESS           VALUE 0.
               88 EXCEPT            VALUE 1.
               88 TIMEOUT           VALUE 2.
               88 ABORTED           VALUE 3.
           03 JOB-ERROR-DETAILS     PIC X(30) VALUE SPACES.
           03 JOB-TOTAL             PIC S9(9)V99 COMP-3.
       77 FLAG-ACCEPTING-JOBS       PIC A.
           88 FLAG-YES              VALUE 'Y'.
           88 FLAG-NO               VALUE 'Y'.

The leading two-digit element is the level, followed by the name of the field, then a picture (or pic) clause, which is akin to a type: it is a description of the shape of the data to be stored. 9(8) means a repetition of 8 digits. One perhaps surprising aspect of PIC 9 is that numbers are represented as characters in the encoding of the environment, not as binary. In other words, if the number 42 was stored in a 9(2) on my machine, that would yield 2 bytes, 0x34 and 0x32 (‘9’ and ‘2’ in ASCII), and not 0x2a in a single 8-bit byte. COBOL does support binary integer and floating-point types, but, unadorned, numbers are more like textual elements that interpreted as having numeric qualities. X(30) means 30 alphanumeric characters. Note that these are fixed-width allocations. Indentation is not significant to the compiler, it is purely to aid readability and communicate the nested hierarchical relationship between fields. The hierarchy is established by increasing level number, which is mirrored in the indentation (but again, not required). Think embedded structs in C or Go. Later, in your procedural code, you would refer to variables with NAME OF JOB-RESULT. (The JOB- prefix on fields is not required in general (ambiguous fields can be qualified with the OF syntax), however several field names would conflict with reserved COBOL words (DATE, SECONDS, ERROR, etc.) otherwise.)

A few more items of note. Level 88 fields are actually like enumerations, a set of valid values that attach to the field that preceded it. JOB-ERROR, with its single digit pic clause, can be one of 0, 1, 2, or 3, which can be referenced in source code symbolically by SUCCESS, EXCEPT, etc. VALUE SPACES is like an initializer. COMP-3 describes a certain binary packed format for numbers, which typically are fixed-point, non-integral values, for storing money or fractional numbers. Level 77 is conventional for top-level variables that aren’t part of a hierarchy or have nested fields themselves. A useful tip is that if a field has a picture clause, you know it is always necessarily a leaf node that represents a scalar value; it will never have children itself.

(Incidentally: COBOL source no longer need be all-caps, nor formatted into rigid columns – modern COBOL compilers support free-format syntax.)

Data produced by COBOL programs tends to be fixed-width. There are exceptions to this, which I’ll get into later. But in general, by knowing the size and offset of any field declared in a copybook, you can read individual elements from a file. We calculate the size of each field by interpreting its picture clause, along with any modifiers like COMP-3, which affect how values are encoded.

There’s much more richness to the syntax and possible clauses and qualifies for fields in copybooks than expressed in this short example. For now, we know enough to make progress toward a useful conversion tool.

Our approach: copybook-driven conversion

To convert the input data file from EBCDIC to ASCII, our approach would be to have our tool read in a copybook at startup and build a map of fields, calculating for each field its size in bytes from its picture clause. Then we would stream in the data, and begin looping through each field in order. For each field, we would read in the determined number of bytes. If the field is numeric with modifiers like COMP-3, we would simply write those bytes straight to the output unmodified—they have the same byte-wise representation in any case. If the field is textual (i.e., containing alphanumeric characters), then convert each character from EBCDIC to its ASCII counterpart, and write that to the output.

Therefore, at the outset, our main task is to ingest copybooks into our program and make sense of their syntax. While at first blush, parsing each field declaration with a regular expression might seem the most straightforward way to tackle the problem, there is enough complexity in copybooks, especially owing to all the possible optional modifiers, that we don’t want to reinvent this wheel if we don’t have to. We’ll look to incorporate an existing library or tool. Since we were using GnuCOBOL as a compiler and runtime for our COBOL code anyway, we will see if it can be adapted for our use-case. GnuCOBOL is written in C, so we should be able to use the bits we need from Go, which is the language of our tool.

In the next part, I’ll get into the details of trying to access the copybook parsing functions of GnuCOBOL from Go.