COBOL, GnuCOBOL, and Go: part 1 🔗
This is the first in an open-ended series of posts as I explore working with COBOL and integrating it with different tools, languages, and platforms. I wanted a place to record my notes and thinking for future reference, as well as to report out any interesting findings or new code.
Motivation
Recently, my team worked on a small project that involved running some COBOL on a different platform than which it had originally been used. Part of this project involved handling data that had been generated on a mainframe by another COBOL program which was upstream in a processing pipeline, of which the code we were handling was a part.
All textual elements were EBCDIC-encoded. Other fields were various representations of numeric data. Our task was to process the data through the COBOL and retain its semantics. We had a known-good test run output by which to compare our results against. The main thing we needed to do was to set up an environment to run the code.
We lacked ready access to a commercial COBOL compiler or a mainframe environment in which to familiarize ourselves with how the original code worked, other than by reading the source. We chose to stick with our regular platform (a Unix) and use the open source GnuCOBOL compiler project. GnuCOBOL compiles COBOL source by first compiling it to C, and then relies on the platform’s C compiler to produce object code or an executable.
Since our platform assumed ASCII-encoded byte streams for text processing, such as character-wise comparisons and sorting, we chose to preprocess the data to map EBCDIC code points to ASCII before passing it into the program for processing.
This proved to be more challenging than we initially expected. A naïve
approach, to map every byte in the file from a range of EBCDIC values to ASCII,
using, say, iconv(1)
, would suffice for textual elements, but would corrupt
numeric encodings. We therefore needed to be guided by the individual types of
the elements encoded in the file, and make selective conversions based on them.
Since the input file was a straightforward serialization of the data structures defined in the COBOL source, we could use those definitions to guide the processing. Data structures in COBOL are commonly known as “copyboooks”, and are similar to structs in C or records in other languages: individual fields indicate their name and data type, and a grouped, hierarchically, by the programmer to convey intent of use.
What are copybooks?
Copybooks are somewhat like .h
files in C, in that they are intended for
the declaration of data structures and the inclusion in potentially multiple
different source files. COBOL data structures can also be declared inline in
a source code file. It appears that the community refers to the collection of
data types and variables defined in a COBOL program as a copybook regardless
of whether it appears in a separate file for inclusion or as inline text.
Here’s an example copybook for a fictional program that processes jobs of some kind:
01 JOB-RESULT
03 JID PIC 9(8).
03 NAME PIC X(30).
03 TIMESTAMP
05 JOB-DATE
07 JOB-MONTH PIC 99.
07 JOB-DAY PIC 99.
07 JOB-YEAR PIC 9(4).
05 JOB-TIME
07 JOB-HOURS PIC 99.
07 JOB-MINUTES PIC 99.
07 JOB-SECONDS PIC 99.
03 JOB-ERROR PIC 9.
88 SUCCESS VALUE 0.
88 EXCEPT VALUE 1.
88 TIMEOUT VALUE 2.
88 ABORTED VALUE 3.
03 JOB-ERROR-DETAILS PIC X(30) VALUE SPACES.
03 JOB-TOTAL PIC S9(9)V99 COMP-3.
77 FLAG-ACCEPTING-JOBS PIC A.
88 FLAG-YES VALUE 'Y'.
88 FLAG-NO VALUE 'Y'.
The leading two-digit element is the level, followed by the name of
the field, then a picture (or pic) clause, which is akin to a type:
it is a description of the shape of the data to be stored. 9(8)
means a
repetition of 8 digits. One perhaps surprising aspect of PIC 9
is that
numbers are represented as characters in the encoding of the environment,
not as binary. In other words, if the number 42 was stored in a 9(2)
on
my machine, that would yield 2 bytes, 0x34 and 0x32 (‘9’ and ‘2’ in ASCII),
and not 0x2a in a single 8-bit byte. COBOL does support binary integer and
floating-point types, but, unadorned, numbers are more like textual elements
that interpreted as having numeric qualities. X(30)
means 30 alphanumeric
characters. Note that these are fixed-width allocations. Indentation is not
significant to the compiler, it is purely to aid readability and communicate
the nested hierarchical relationship between fields. The hierarchy is
established by increasing level number, which is mirrored in the indentation
(but again, not required). Think embedded structs in C or Go. Later, in your
procedural code, you would refer to variables with NAME OF JOB-RESULT
. (The
JOB-
prefix on fields is not required in general (ambiguous fields can be
qualified with the OF
syntax), however several field names would conflict
with reserved COBOL words (DATE
, SECONDS
, ERROR
, etc.) otherwise.)
A few more items of note. Level 88 fields are actually like enumerations, a
set of valid values that attach to the field that preceded it. JOB-ERROR
,
with its single digit pic clause, can be one of 0, 1, 2, or 3, which can
be referenced in source code symbolically by SUCCESS
, EXCEPT
, etc.
VALUE SPACES
is like an initializer. COMP-3
describes a certain binary
packed format for numbers, which typically are fixed-point, non-integral
values, for storing money or fractional numbers. Level 77 is conventional
for top-level variables that aren’t part of a hierarchy or have nested
fields themselves. A useful tip is that if a field has a picture clause,
you know it is always necessarily a leaf node that represents a scalar value;
it will never have children itself.
(Incidentally: COBOL source no longer need be all-caps, nor formatted into rigid columns – modern COBOL compilers support free-format syntax.)
Data produced by COBOL programs tends to be fixed-width. There are exceptions
to this, which I’ll get into later. But in general, by knowing the size
and offset of any field declared in a copybook, you can read individual
elements from a file. We calculate the size of each field by interpreting
its picture clause, along with any modifiers like COMP-3
, which affect
how values are encoded.
There’s much more richness to the syntax and possible clauses and qualifies for fields in copybooks than expressed in this short example. For now, we know enough to make progress toward a useful conversion tool.
Our approach: copybook-driven conversion
To convert the input data file from EBCDIC to ASCII, our approach would
be to have our tool read in a copybook at startup and build a map of
fields, calculating for each field its size in bytes from its picture
clause. Then we would stream in the data, and begin looping through each
field in order. For each field, we would read in the determined number
of bytes. If the field is numeric with modifiers like COMP-3
, we would
simply write those bytes straight to the output unmodified—they have the
same byte-wise representation in any case. If the field is textual (i.e.,
containing alphanumeric characters), then convert each character from EBCDIC
to its ASCII counterpart, and write that to the output.
Therefore, at the outset, our main task is to ingest copybooks into our program and make sense of their syntax. While at first blush, parsing each field declaration with a regular expression might seem the most straightforward way to tackle the problem, there is enough complexity in copybooks, especially owing to all the possible optional modifiers, that we don’t want to reinvent this wheel if we don’t have to. We’ll look to incorporate an existing library or tool. Since we were using GnuCOBOL as a compiler and runtime for our COBOL code anyway, we will see if it can be adapted for our use-case. GnuCOBOL is written in C, so we should be able to use the bits we need from Go, which is the language of our tool.
In the next part, I’ll get into the details of trying to access the copybook parsing functions of GnuCOBOL from Go.