This documentation notes various procedures available for sorting data sets on the USC system. Batch sort utilities on MVS, interactive sort procedures on VM/CMS, as well as a brief examination of internal sort procedures available in SAS and SPSSX are included.
Two cataloged procedures on the MVS system at USC are utilities specifically designed for sorting data sets: USCSORT and LRGSORT. The size of the data set determines which utility should be executed. USCSORT and LRGSORT Utilities discusses these utilities and includes a formula for determining which sort utility will be most efficient for sorting a data set given the number of records in the data set and the record length of the data set.
In addition to these two sort utilities, many software packages have internal sort procedures. Sorting Using SAS and SPSSX briefly examines procedures available in the SAS and SPSSX packages for sorting data sets. Both the batch sort utilities and software package sort routines on the MVS machine are based on SYNCSORT OS, a high-performance sort utility designed for use on large systems.
Procedures for sorting VM/CMS files interactively are discussed in Sorting CMS Data Sets.
Related Documentation lists sources of related documentation.
This documentation was produced by the Academic Research and Data Center of USC. Questions about its content should be addressed to a consultant at 777-6865.
Sort utilities are useful in any situation in which the records in a data set need to be
re-ordered.
For example, a user may have a tape containing student survey data
alphabetically ordered by last name. Sort utilities are an efficient means to create
a new data set sequenced by Social Security Number, age, race, or any other variable
contained in the data. USCSORT and LRGSORT differ in their space allocations. This
section lists the range of space allocations appropriate for each and outlines the job
setup to run these utilities.
Determining which Sort Utility is Needed
The utilities USCSORT and LRGSORT differ in their space allocations. To determine the appropriate sort utility for your data set, and the work space requirements, use the following procedure:
1. Compute the following formula where
A = number of records to be sorted
B = average length of record
N = (A x B x .4) / 56,664
N is the approximate number of tracks required on each of
three work packs to sort your dataset.
2. Use the following table to determine the appropriate sort utility
based on the value of 'N'.
Value of N Appropriate Utility
------------------ -------------------
N < 5000 USCSORT
5000 <= N < 13000 LRGSORT
13000 <= N Consult Academic Research and Data Center
The LRGSORT procedure includes a preliminary step that verifies that the requested space is available on the sort volumes. The subsequent LRGSORT SORT step is not executed until the requested space becomes available.
When reviewing job output for either of the cataloged sort procedures, note that the USCSORT procedures generate only one set of condition codes. However LRGSORT procedure job output will have two sets of condition codes, one for the preliminary SORVCHK step, and one for the SORT step in which the actual sort takes place.
Job Set-up
The job set-up for the two utilities is identical except for the name of the procedure to be executed; substitution of 'LRGSORT' for 'USCSORT' will automatically give the user an appropriate amount of space for sorting. The general format of the JCL necessary to execute these utilities follows:
JOB statement
// EXEC sort,SPACE=tracks
//SORT.SORTIN DD (parameters describing input data set)
//SORT.SORTOUT DD (parameters describing output data set)
//SORT.SYSIN DD *
SORT FIELDS=(parameters)
/*
EXEC statement - is required; the name of the appropriate utility should be
substituted for 'sort'. The SPACE parameter is recommended for LRGSORT in
all cases and for USCSORT if the value of 'N' computed in step 1 above exceeds 500.
In these cases, simply replace 'tracks' with the value of 'N'. For example,
// EXEC USCSORT or
// EXEC LRGSORT,SPACE=6000
SORT.SORTIN DD statement - this JCL statement defines the data set to be sorted;
it is required. The SORTIN file must have physical sequential organization
or be a member of a partitioned data set. The parameters for defining the
SORTIN data set are the same as those outlined for input data sets in the ARDC
JCL
documentation.
SORT FIELDS=(p,l,f,o,...,p,l,f,o) <,SKIPREC=N>
<,EQUALS>
<,NOEQUALS>
SORT statement parameters are described below:
FIELDS parameter - fields are referenced by position ('p'), length
('l'), data format ('f'), and order value ('o').
Position - defines the relative location of the field,
within a record, on which the sort is to be performed. For example, if
the user wishes to sort a group of survey participants by Social Security Number
(SSN), and SSN begins in the 18th byte of each record, the value of 'p' is '18'.
Length - represents the number of bytes contained in the
field on which the sort is to be performed. Continuing with the example
cited above, 'l' would assume a value of '9' because SSN occupies nine bytes in
each record.
Data format code - data can be assigned 13 different format
codes; the format code chosen is dependent upon the type of data to be sorted.
In our example (above) 'f' would take on a value of 'CH' indicating
character format. Other common data format codes are 'BI' for binary data
and 'PD' for packed decimal data.
Order value - indicates the sequence in which the data
is to be sorted, that is, ascending ('A') or descending ('D') sort order.
The student survey example used above, if sorted in ascending order, would be coded in the following manner:
SORT FIELDS=(18,9,CH,A)
FORMAT parameter - in cases where more than one control field is specified,
but all the data format codes (f) are the same (all character or all packed
decimal for example), the SORT statement may be coded with the format parameter
in the following general form:
SORT FIELDS=(p,l,o,...,p,l,o),FORMAT=f
In the example that follows, the primary sort is on the field beginning in position
'18' for a length of '2', with a secondary sort on the field beginning in position
'7' for a length of '4'. Both fields are sorted in ascending order ('A'), and both
contain data in character format ('CH'). Assume that the first sort field
is 'age' and the second sort field is 'final four digits of SSN'. The output
file will be in ascending order by age and, among records of identical age, in
ascending order by the final four digits of SSN.
SORT FIELDS=(18,2,A,7,4,A),FORMAT=CH
SKIPREC and EQUALS/NOEQUALS parameters - EQUALS/NOEQUALS are both optional
parameters. SKIPREC=n instructs the sort to skip 'n' records before sorting
the input file. The records skipped are deleted before sorting and not included
in the output file. For example:
SORT FIELDS=(18,9,CH,A),SKIPREC=5
Skips the first five records before sorting.
When specified, EQUALS acts to preserve the original order of records that contain equal control fields. For example, using the EQUALS option on an alphabetical listing of names being sorted by zip code, a user would get output with alphabetical order intact within equal zip codes. The EQUALS option decreases sort efficiency slightly and should therefore be used only when necessary.
The SAS PROC SORT procedure allocates 4 cylinders (849,960 bytes per cylinder) of sort space. This amount should suffice for most applications. However, if additional space is necessary, contact the Academic Research and Data Center at 777-6865 for assistance. Consult the SAS documentation for sources of more information on SAS and PROC SORT.
SPSSX uses the SORT CASES control statement to sort data. Refer to the SPSSX documentation for assistance and sources of more information on SPSSX.
CMS data sets can be sorted in the CMS environment or the XEDIT (editor) environment.
The general format of the SORT command for the CMS environment is
SORT fileid1 fileid2
fileid1 is the identifier (filename filetype filemode) of the CMS file to be sorted
DMSSRT604R ENTER SORT FIELDS:
Respond by entering one or more pairs of numbers in the form 'xx yy'.
SORT OLD JCL A NEW JCL A
causes CMS to ask for sort fields. In response, the following is entered
10 20 1 5
resulting in 'OLD JCL A' being sorted on the primary field in columns 10-20 and on the
field in columns 1-5 within the primary field; the sorted data set would be stored in
'NEW JCL A'.
The general format of the SORT macro, which sorts the file currently being edited (XEDIT environment), is:
SORT n o col1 col2 ...
n - replace with the number of lines to be sorted (starting at the current line).
If 'n' is specified as an asterisk (*), lines are sorted from the current line to
the end of the file.
SORT * 10 20 1 5
sorts all lines of the file currently being edited on the primary sort field in columns
10-20 and the secondary sort field in columns 1-5.
For more information concerning sorting CMS data sets, consult the CMS User's Guide (SC19-6210) available for purchase from IBM and for reference in the CS Reference Room, Third Floor, Computer Services Building.
The ARDC JCL documentation provides basic information concerning Job Control Language required for running jobs at USC. This and other documents that cover various topics related to computer use at USC are available at ARDC Documentation and at ARS Handouts.
*SAS is the registered trademark of SAS Institute Inc., Cary, N.C. 27511, U.S.A. SAS/GRAPH and SAS/ETS are trademarks of SAS Institute Inc., Cary, N.C. 27511, U.S.A.