SORTING DATA SETS AT USC


Introduction
USCSORT and LRGSORT Utilities
Sorting Using SAS and SPSSX
Sorting CMS Data Sets
Related Documentation

Introduction

This documentation notes various procedures available for sorting data sets on the USC system.  Batch sort utilities on MVS, interactive sort procedures on VM/CMS, as well as a brief examination of internal sort procedures available in SAS and SPSSX are included.

Two cataloged procedures on the MVS system at USC are utilities specifically designed for sorting data sets:  USCSORT and LRGSORT.  The size of the data set determines which utility should be executed. USCSORT and LRGSORT Utilities discusses these utilities and includes a formula for determining which sort utility will be most efficient for sorting a data set given the number of records in the data set and the record length of the data set.

In addition to these two sort utilities, many software packages have internal sort procedures. Sorting Using SAS and SPSSX briefly examines procedures available in the SAS and SPSSX packages for sorting data sets.  Both the batch sort utilities and software package sort routines on the MVS machine are based on SYNCSORT OS, a high-performance sort utility designed for use on large systems.

Procedures for sorting VM/CMS files interactively are discussed in Sorting CMS Data Sets.

Related Documentation lists sources of related documentation.

This documentation was produced by the Academic Research and Data Center of USC.  Questions about its content should be addressed to a consultant at 777-6865.

USCSORT and LRGSORT Utilities

Sort utilities are useful in any situation in which the records in a data set need to be re-ordered.
For example, a user may have a tape containing student survey data alphabetically ordered by last name.  Sort utilities are an efficient means to create a new data set sequenced by Social Security Number, age, race, or any other variable contained in the data.  USCSORT and LRGSORT differ in their space allocations.  This section lists the range of space allocations appropriate for each and outlines the job setup to run these utilities.


Determining which Sort Utility is Needed

The utilities USCSORT and LRGSORT differ in their space allocations. To determine the appropriate sort utility for your data set, and the work space requirements, use the following procedure:

1. Compute the following formula where


            A = number of records to be sorted
            B = average length of record
            N = (A x B x .4) / 56,664
    N is the approximate number of tracks required on each of three work packs to sort your dataset.


2. Use the following table to determine the appropriate sort utility based on the value of 'N'.


           Value of N               Appropriate Utility
       ------------------           -------------------
                N <  5000                 USCSORT
        5000 <= N < 13000                 LRGSORT
       13000 <= N              Consult Academic Research and Data Center

3. If you are using USCSORT and N > 500, it is recommended that you code the SPACE parameter on the EXEC statement as shown in the next section, specifying the value of 'N'.  Likewise, for LRGSORT, because of considerations explained in the box below, it is recommended that you always code the SPACE parameter on the EXEC statement.

The LRGSORT procedure includes a preliminary step that verifies that the requested space is available on the sort volumes.  The subsequent LRGSORT SORT step is not executed until the requested space becomes available.

When reviewing job output for either of the cataloged sort procedures, note that the USCSORT procedures generate only one set of condition codes.  However LRGSORT procedure job output will have two sets of condition codes, one for the preliminary SORVCHK step, and one for the SORT step in which the actual sort takes place.


Job Set-up

The job set-up for the two utilities is identical except for the name of the procedure to be executed; substitution of 'LRGSORT' for 'USCSORT' will automatically give the user an appropriate amount of space for sorting.  The general format of the JCL necessary to execute these utilities follows:


     JOB statement
   // EXEC sort,SPACE=tracks
   //SORT.SORTIN   DD (parameters describing input data set)
   //SORT.SORTOUT  DD (parameters describing output data set)
   //SORT.SYSIN    DD *
    SORT FIELDS=(parameters)
   /*
EXEC statement - is required; the name of the appropriate utility should be substituted for 'sort'.  The SPACE parameter is recommended for LRGSORT in all cases and for USCSORT if the value of 'N' computed in step 1 above exceeds 500. In these cases, simply replace 'tracks' with the value of 'N'.  For example,

     // EXEC USCSORT or
     // EXEC LRGSORT,SPACE=6000
SORT.SORTIN DD statement - this JCL statement defines the data set to be sorted; it is required.  The SORTIN file must have physical sequential organization or be a member of a partitioned data set.  The parameters for defining the SORTIN data set are the same as those outlined for input data sets in the ARDC JCL documentation.
SORT.SORTOUT DD statement - this JCL statement defines the output file and is required.  The parameters for defining the SORTOUT data set are the same as those outlined for output data sets in the JCL documentation.
SORT.SYSIN DD * - this JCL statement is required.  It indicates that the SORT statement will follow.
SORT statement - the SORT statement begins in column two, is required, and specifies the manner in which the records in the data set are to be manipulated.  As many as 64 control fields may be specified, listed in order of greatest to least priority.  Each sequence of p, l, f, o describes a single control field.  For example:

     SORT FIELDS=(p,l,f,o,...,p,l,f,o) <,SKIPREC=N>
                         <,EQUALS>
                         <,NOEQUALS>
SORT statement parameters are described below:

FIELDS parameter - fields are referenced by position ('p'), length ('l'), data format ('f'), and order value ('o').
    Position - defines the relative location of the field, within a record, on which the sort is to be performed.  For example, if the user wishes to sort a group of survey participants by Social Security Number (SSN), and SSN begins in the 18th byte of each record, the value of 'p' is '18'.
    Length - represents the number of bytes contained in the field on which the sort is to be performed.  Continuing with the example cited above, 'l' would assume a value of '9' because SSN occupies nine bytes in each record.
    Data format code - data can be assigned 13 different format codes; the format code chosen is dependent upon the type of data to be sorted.  In our example (above) 'f' would take on a value of 'CH' indicating character format.  Other common data format codes are 'BI' for binary data and 'PD' for packed decimal data.
    Order value - indicates the sequence in which the data is to be sorted, that is, ascending ('A') or descending ('D') sort order.

The student survey example used above, if sorted in ascending order, would be coded in the following manner:


        SORT FIELDS=(18,9,CH,A)
FORMAT parameter - in cases where more than one control field is specified, but all the data format codes (f) are the same (all character or all packed decimal for example), the SORT statement may be coded with the format parameter in the following general form:

        SORT FIELDS=(p,l,o,...,p,l,o),FORMAT=f
In the example that follows, the primary sort is on the field beginning in position '18' for a length of '2', with a secondary sort on the field beginning in position '7' for a length of '4'.  Both fields are sorted in ascending order ('A'), and both contain data in character format ('CH').  Assume that the first sort field is 'age' and the second sort field is 'final four digits of SSN'.  The output file will be in ascending order by age and, among records of identical age, in ascending order by the final four digits of SSN.

        SORT FIELDS=(18,2,A,7,4,A),FORMAT=CH
SKIPREC and EQUALS/NOEQUALS parameters - EQUALS/NOEQUALS are both optional parameters.  SKIPREC=n instructs the sort to skip 'n' records before sorting the input file.  The records skipped are deleted before sorting and not included in the output file.  For example:

        SORT FIELDS=(18,9,CH,A),SKIPREC=5
Skips the first five records before sorting.

When specified, EQUALS acts to preserve the original order of records that contain equal control fields.  For example, using the EQUALS option on an alphabetical listing of names being sorted by zip code, a user would get output with alphabetical order intact within equal zip codes.  The EQUALS option decreases sort efficiency slightly and should therefore be used only when necessary.

Sorting Using SAS and SPSSX

The SAS PROC SORT procedure allocates 4 cylinders (849,960 bytes per cylinder) of sort space.  This amount should suffice for most applications.  However, if additional space is necessary, contact the Academic Research and Data Center at 777-6865 for assistance.  Consult the SAS documentation for sources of more information on SAS and PROC SORT.

SPSSX uses the SORT CASES control statement to sort data.  Refer to the SPSSX documentation for assistance and sources of more information on SPSSX.

Sorting CMS Data Sets

CMS data sets can be sorted in the CMS environment or the XEDIT (editor) environment.

The general format of the SORT command for the CMS environment is


     SORT fileid1 fileid2
fileid1 is the identifier (filename filetype filemode) of the CMS file to be sorted
fileid2 is the identifier (filename filetype filemode) of the different from 'fileid1'.  After the SORT command is entered, CMS issues the following message:
   
     DMSSRT604R  ENTER SORT FIELDS:
Respond by entering one or more pairs of numbers in the form 'xx yy'.
xx is the starting position of the sort field within each record
yy is the ending position.  For example,
   
     SORT OLD JCL A NEW JCL A
causes CMS to ask for sort fields.  In response, the following is entered

     10 20 1 5
resulting in 'OLD JCL A' being sorted on the primary field in columns 10-20 and on the field in columns 1-5 within the primary field; the sorted data set would be stored in 'NEW JCL A'.

The general format of the SORT macro, which sorts the file currently being edited (XEDIT environment), is:


     SORT n o col1 col2 ...
n - replace with the number of lines to be sorted (starting at the current line). If 'n' is specified as an asterisk (*), lines are sorted from the current line to the end of the file.
o - replace with 'A' for ascending sort order or 'D' for descending sort order.
col1 - replace with the starting column of the primary sort field
col2 - replace with the ending column (additional pairs of numbers representing additional sort fields can be included).  For example,

     SORT * 10 20 1 5
sorts all lines of the file currently being edited on the primary sort field in columns 10-20 and the secondary sort field in columns 1-5.

For more information concerning sorting CMS data sets, consult the CMS User's Guide (SC19-6210) available for purchase from IBM and for reference in the CS Reference Room, Third Floor, Computer Services Building.

Related Documentation

The ARDC JCL documentation provides basic information concerning Job Control Language required for running jobs at USC.  This and other documents that cover various topics related to computer use at USC are available at ARDC Documentation and at ARS Handouts.

ARDC Documentation     ARDC Home Page     USC Home Page    

*SAS is the registered trademark of SAS Institute Inc., Cary, N.C. 27511, U.S.A. SAS/GRAPH and SAS/ETS are trademarks of SAS Institute Inc., Cary, N.C. 27511, U.S.A.

This page updated September 17, 1999 by Amy W. Yarbrough, Academic Research and Data Center.
Copyright © 1999, The Board of Trustees of the University of South Carolina.
URL http://www.sc.edu/ardc/docs/sort.htm