Spreadsheets As An Intermediate Step In Data Analysis

                          by Willie Lockeretz

     When I first started using a Kaypro, I didn't expect to have much
use for a spreadsheet.  The little I knew about this type of program was
gleaned from advertisements, which always featured things like financial
planning, depreciation schedules, and profit and loss projections.  As
someone who knows nothing -- and cares even less -- about Presentation
Graphics, Bottom Lines, or Corner Offices, I didn't see that there could
be much in it for me.  It seemed that the main shtick was that this or
that spreadsheet could put me on the Fast Track, in turn letting me get
Up the Corporate Ladder, at the top of which, presumably, I would find
the Key to the Executive Bathroom.

     Well I was wrong.  Nowadays I probably get more use out of
spreadsheets than any other single category of software.  But first I
needed some consciousness-raising.  For starters, this meant
acknowledging that no matter what the numbers represent, a tabular array
of numbers is a tabular array of numbers, and a program that lets you
manipulate such an array quickly and conveniently could be very handy
even if the numbers didn't happen to have dollar signs in front of them.
Second, I realized that the manipulations you might perform with such a
program could be very different from the operations the
dressed-for-success types in the ads were doing.  In fact, spreadsheets
could be useful even if you didn't do any manipulations with them at
all.

     That was the real breakthrough.  Spreadsheets might be great for
calculating net present value or internal rate of return, but in my work
I do statistical computations that would be impossible on a spreadsheet,
things like multiple regressions and factor analysis.  For this I need a
scientific statistics package.  But rather than ruling out spreadsheets,
I started using them to prepare the data for a statistics program that
would do the real analysis.  In other words, the fact that a spreadsheet
can't do the kinds of calculations I need is not important -- I don't
depend on it for calculations.  But as an intermediary between raw data
and statistical analysis, a spreadsheet has simplified my life
immensely.  I happen to favor SuperCalc2, which works very nicely on a
CP/M Kaypro, but the basic idea would hold for any reasonably powerful
spreadsheet.

     I often work with moderately large arrays of demographic, economic,
and agricultural data -- perhaps 12 numbers for each of 150 counties,
say.  I use SuperCalc2 for three purposes.  First, it's a convenient way
to enter the data.  Second, it can help spot large errors (like an extra
zero after a number).  Third, it's a convenient way to compute
additional variables from the raw data.  After it has done all that, the
statistics program is ready to take over.  A statistics package could
have done the first three tasks, too, but most don't do them as
conveniently.  This added convenience should not entail any new
problems.  Any good statistics program and any good spreadsheet should
have at least one format in common that permits them to be linked this
way.  That is, the spreadsheet should be able to write a file that the
statistics program can use as input.

     I begin an analysis with a blank spreadsheet that has the right
column and row labels, but with a dashed line where each number is to
go.  Printed out, this is a very efficient form on which to copy numbers
from a source that you can't take back to the office, such as the
Census.  (A laptop would eliminate the needed for writing the numbers
down at all, of course.)  When you enter the data into the computer, the
spreadsheet on the screen corresponds in every visual detail to the
sheet from which you are reading the numbers -- the rows and columns are
in the right order, the headings are identical, the column widths are
the same, and so forth.  This helps eliminate a very common error --
your eye skipping to the wrong line.

      Checking for order-of-magnitude errors is easily done by asking
the spreadsheet to report the maximum and minimum value in a row or
column.   For some kinds of data, values that are way out of line are
obviously wrong.  The average age in a county is not likely to be 272 --
27.2 is more like it.  But for some variables an extreme value is not
necessarily an error.  In that case, the trick is to construct a new
variable whose range is more restricted than the original one.  For
example, the population of towns and cities might range from a few
hundred to several million.  But if you are using population data from
two different census years, the spreadsheet can quickly compute the
percentage change.  If you see a change of +923% in a decade, there is a
pretty good chance that a digit was repeated or a decimal misplaced --
exactly the kind of error that is most likely to occur with manual data
entry.  If you are clever in concocting new variables you should be able
to detect every order-of-magnitude error.  It's not a substitute for
comparing each number to the source, of course.  But that can be very
monotonous and wearying, which means an error can slip past.  It's nice
to have a second line of defense.

     Finally, most data analysis will require you to construct new
variables from the raw data (I'm talking now about quantities that are
of interest in their own right, not just as a way of detecting errors).
I prefer doing this on the spreadsheet rather than with the statistics
package because I can see the results before they get irrevocably stored
with all the other data.  A statistics package is a black box: you put
in the data at one end, and you get results out the other.  But you
don't see what goes on in between.  I like to keep an eye on the data as
much as possible.  With a spreadsheet you can inspect the newly created
variables, get familiar with them, fondle them, do whatever you want
with them (in private, and with their consent, of course). I prefer to
relinquish control to the statistics package only after the spreadsheet
has done everything it can.

     In short, I use a statistics package for the things that only a
statistics package can do.  But a spreadsheet is a much more practical
and pleasant way to do the tedious but necessary preliminaries: to enter
the raw numbers, check them, and carry out first-level processing.  As
soon as I started using SuperCalc2 this way, I found that it was a true
anomaly in the world of commercial personal computer packages -- a
program suitable for an even wider range of applications than its
advertising claimed.  It also is a true anomaly in having earned the
highest tribute I have ever bestowed on anything in my overflowing box
of non-bundled software: I own it legally.

-- from The Boston Kugel, published by The Boston Kaypro Users Group
(BOSKUG) of the Boston Computer Society.