SAS Questions

Index Document SAS SAS Questions SAS Main Frames Example. Data Warehousing C Programs Trouble Shooting Favorite Sites Informatica Blog

SAS/Data Warehouse

Data Warehouse

Abbreviated DW, a collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time.

Development of a data warehouse includes development of systems to extract data from operating systems plus installation of a warehouse database system that provides managers flexible access to the data.

The term data warehousing generally refers to the combination of many different databases across an entire enterprise. Contrast with data mart.

Data Base

(1) Often abbreviated DB. A collection of information organized in such a way that a computer program can quickly select desired pieces of data. You can think of a database as an electronic filing system.

Traditional databases are organized by fields, records, and files. A field is a single piece of information; a record is one complete set of fields; and a file is a collection of records. For example, a telephone book is analogous to a file. It contains a list of records, each of which consists of three fields: name, address, and telephone number.

An alternative concept in database design is known as Hypertext. In a Hypertext database, any object, whether it be a piece of text, a picture, or a film, can be linked to any other object. Hypertext databases are particularly useful for organizing large amounts of disparate information, but they are not designed for numerical analysis.

To access information from a database, you need a database management system (DBMS). This is a collection of programs that enables you to enter, organize, and select data in a database.

(2) Increasingly, the term database is used as shorthand for database management system

Metadata

Data about data. Metadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata is essential for understanding information stored in data warehouses and has become increasingly important in XML-based Web applications.

Data mart

A database, or collection of databases, designed to help managers make strategic decisions about their business. Whereas a data warehouse combines databases across an entire enterprise, data marts are usually smaller and focus on a particular subject or department. Some data marts, called dependent data marts, are subsets of larger data warehouses.

Data mining

A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior. For example, data mining software can help retail companies find customers with common interests. The term is commonly misused to describe software that presents data in new ways. True data mining software doesn't just change the presentation, but actually discovers previously unknown relationships among the data.

Data mining is popular in the science and mathematical fields but also is utilized increasingly by

Data scrubbing or Data cleaning

Also referred to as data scrubbing, the act of detecting and removing and/or correcting a database’s dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly). The goal of data cleansing is not just to clean up the data in a database but also to bring consistency to different sets of data that have been merged from separate databases. Sophisticated software applications are available to clean a database’s data using algorithms, rules and look-up tables, a task that was once done manually and therefore still subject to human error Also referred to as data scrubbing, the act of detecting and removing and/or correcting a database’s dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly). The goal of data cleansing is not just to clean up the data in a database but also to bring consistency to different sets of data that have been merged from separate databases. Sophisticated software applications are available to clean a database’s data using algorithms, rules and look-up tables, a task that was once done manually and therefore still subject to human error

Data Modeling

The analysis of data objects and their relationships to other data objects. Data modeling is often the first step in database design and object-oriented programming as the designers first create a conceptual model of how data items relate to each other. Data modeling involves a progression from conceptual model to logical model to physical schema

Data Engine

The part of a database management system (DBMS) that actually stores and retrieves data. Most DBMS's include an Application Programming Interface (API) that enables you to directly control the engine without going through the DBMS's user interface.

Dirty Data

In reference to databases, data that contain errors. Dirty data can contain such mistakes as spelling or punctuation, incorrect data associated with a field, incomplete or outdated data or even data that is duplicated in the database.

ODS

Short for operational data store, a type of database that serves as an interim area for a data warehouse in order to store time-sensitive operational data that can be accessed quickly and efficiently. In contrast to a data warehouse, which contains large amounts of static data, an ODS contains small amounts of information that is updated through the course of business transactions. An ODS will perform numerous quick and simple queries on small amounts of data, such as acquiring an account balance or finding the status of a customer order, whereas a data warehouse will perform complex queries on large amounts of data. An ODS contains only current operational data while a data warehouse contains both current and historical data

Schema

Data Independency

The separation of data from the programs that use the data. Nearly all modern applications are based on the principle of data independence. In fact, the whole concept of a database management system (DBMS) supports the notion of data independence since it represents a system for managing data separately from the programs that use the data. In contrast, it is possible to write applications in which the data being processed is actually represented in the program's source code. This data-dependent approach is very inflexible because it makes it difficult to modify the data and it also makes the data inaccessible to other programs

Domain

Data Integrity

Refers to the validity of data. Data integrity can be compromised in a number of ways:

· Human errors when data is entered

· Errors that occur when data is transmitted from one computer to another

· Software bugs or viruses

· Hardware malfunctions, such as disk crashes

· Natural disasters, such as fires and floods

There are many ways to minimize these threats to data integrity. These include:

· Backing up data regularly

· Controlling access to data via security mechanisms

· Designing user interfaces that prevent the input of invalid data

· Using error detection and correction software when transmitting data

Data

(1) Distinct pieces of information, usually formatted in a special way. All software is divided into two general categories: data and programs. Programs are collections of instructions for manipulating data.

Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind.

Strictly speaking, data is the plural of datum, a single piece of information. In practice, however, people use data as both the singular and plural form of the word.

(2) The term data is often used to distinguish binary machine-readable information from textual human-readable information. For example, some applications make a distinction between data files (files that contain binary data) and text files (files that contain ASCII data).

(3) In database management systems, data files are the files that store the database information, whereas other files, such as index files and data dictionaries, store administrative information,

Basics of Report Writing - What’s New?

TS-610 - "Basics of Report Writing in Version 6" is an overview of writing to external files using the DATA step. The information found in TS-610 still applies in newer releases of SAS, but some new tricks and time-savers have been added as well.

This paper assumes you are familiar with the information covered in TS-610 and presents additions to DATA step report writing through Release 8.2 TSLEVEL 2M0.

Additions to the FILE Statement

DSD

The DSD option on the FILE statement enables you to write variables’ values that contain embedded delimiters using LIST output. DSD is ignored for all other types of output (i.e. formatted, column, and named). When DSD is specified, any variable’s value that contains the specified delimiter is automatically quoted with double quotation marks (") when it is written. A variable's value that does not contain the specified delimiter is not quoted. However, you can use the tilde ("~") format modifier to force any variable's value to be quoted, even if it contains no delimiter.

For example:

DATA ONE;

  INPUT X $ Y $ Z $;

DATALINES;

apple banana figs,red

DATA _NULL_;

  SET ONE;

  FILE PRINT DSD;

  PUT X ~ Y ~ Z ;

RUN;

/* RESULTS */

"apple","banana","figs,red"

DLM/ DELIMITER=

Using the DELIMITER= option, you can write data that are delimited by a specified character other than a blank. This option is valid with LIST style output only. Unlike INFILE DELIMITER= processing, only one character is used as the output delimiter.

FOOTNOTES

Now you can use footnotes in addition to titles. Footnotes, like titles, are global and their text can be defined outside of a DATA step. To see the footnotes in your DATA step report, make sure you specify the FOOTNOTE option on the FILE statement. The default is NOFOOTNOTES.

ODS

Adding the ODS option to the FILE statement binds a table definition to the data component to create an output object. This object is sent to all open ODS output destinations. The fileref must be PRINT when ODS is specified on the FILE statement.

ODS is too broad a subject to be covered here. For more information see The Complete Guide to the SAS Output Delivery System, Version 8.

_FILE_

Use the _FILE_= option to assign the contents of the current output buffer to a variable. The variable is automatically retained and initialized to blanks. The maximum length of the variable is the value of the LRECL = option specified on the FILE statement.

The following example uses the variable TEMP1BUF to manipulate the current data in the output buffer.

DATA _NULL_;

  /* routing to the output window */

  FILE print _FILE_=temp1buf;

  temp1buf = 'Line one in temp1';

  PUT;

  PUT 'TAG! You are it!' @;

  fred=TEMP1buf;

  SUBSTR(temp1buf,6,7) = "I'm not";

  PUT;

  /* routing to the log */

  FILE log _FILE_=logbuf;

  Logbuf="***** "||TRIM(fred)||" *****";

  PUT ;

RUN;

/* RESULTS  from the OUTPUT window */

Line one in temp1

TAG! I'm not it!

/* RESULTS from the LOG window */

1    DATA _NULL_;

2       FILE print _FILE_=temp1buf;

3       temp1buf = 'Line one in temp1';

4       PUT;

5       PUT 'TAG! You are it!' @;

6       fred=TEMP1buf;

7       SUBSTR(temp1buf,6,7) = "I'm not";

9       PUT;

11      FILE log _FILE_=logbuf;

12      Logbuf="***** "||TRIM(fred)||" *****";

13      PUT ;

14   RUN;

***** TAG! You are it! *****

NOTE: 2 lines were written to file PRINT.

See also "New Automatic Variable" for an example using the automatic variable _FILE_.

New Automatic Variable

Similar to the _FILE_ option on the FILE statement, one can access the output buffer with the automatic variable _FILE_ . This can be used to modify the data in the current output buffer before it is output.

Here is an example where we want to take a SAS date variable and write it out to a flat file to look like "01 January of the year 2001".

Prior to Version 7, you can use TRIM and PUT functions to create this result:

DATA _NULL_;

  INPUT d mmddyy6.;

  FILE PRINT;

  field=TRIM(LEFT(PUT(day(d),z2.)))||' '||

        TRIM(LEFT(PUT(d,monname.)))||' of the year '|| PUT(d,year.);

  PUT  field;

DATALINES;

/* RESULTS */

04 September of the year 2001

20 March of the year 2001

Assigning the values of _FILE_ to a variable allows us to manipulate values (in our example, formatted versions of the values) that are in the output buffer before the buffer output, thus creating desired output with fewer function calls. Here is the example again, using _FILE_ and an easy way to uppercase the buffer just before it is output.

DATA _NULL_;

  INPUT d mmddyy6.;

  FILE PRINT;

  FORMAT d monname. dd z2. y year.;

  dd=day(d);

  y=d;

  PUT dd d " of the year " y @;

  field=UPCASE(_FILE_);

  PUT @1 field ;

DATALINES;

/* RESULTS */

04 SEPTEMBER  OF THE YEAR 2001

20 MARCH  OF THE YEAR 2001

Alternative to the DATA step

If you are creating a flat file, a comma or tab delimited file for example, and you do not need a lot of control over the output, you may want to try PROC EXPORT.

The syntax is relatively straightforward, and so is the output. There are no options for titles, footnotes, headers, etc.

PROC EXPORT data=dsn

  OUTFILE="c:\myfiles\report.txt"

  DBMS=DLM;

  DELIMITER=’09’x;

RUN;

The code above takes the data set WORK.DSN and creates a tab-delimited file called REPORT.TXT in the folder MYFILES, which resides on my C drive. Any single character can be specified as the delimiter. Note that the first semicolon appears after DBMS=. I put the code on 5 lines instead of 3, for ease of reading. (This is standard layout as well.)

REPORT.TXT has variable names from WORK.DSN as the first row of data. The remaining rows have the value of the variables as they were on the PDV, separated by a tab. If a SAS-provided permanent format is associated with a variable, the formatted value appears in the delimited file. (EXPORT does not handle user-defined formats. If the format is user-written, you will have to create a character variable using the PUT function and the user-defined format instead, and pass that variable to PROC EXPORT.)

To learn more about PROC EXPORT, see the SAS Procedures Guide, Version 8.

Reading Delimited Text Files Into SAS

Since the release of Version 6, some new options are available that many users are not aware of. This paper is intended to explain some of the new options, as well as demonstrate the use of older options that might be helpful.

When reading text files with the DATA step, two statements are used: the INFILE statement and the INPUT statement.

The INFILE statement is used to specify the physical file being read, as well as any options pertaining to the file. You can use a FILENAME statement in conjunction with an INFILE statement, as such:

FILENAME myfile "C:\sasfiles\testfile.csv";

DATA A;

INFILE MYFILE;

or you can specify the fully qualified path to the file on the INFILE statement, as such:

INFILE "C:\ sasfiles\testfile.txt";

The set of options explained below are used on the INFILE statement:

DELIMITER= (DLM=) - This option allows you to tell SAS what character is used as a delimiter in a file. If this option is not specified, SAS assumes the delimiter is a space. Some common delimiters are comma, vertical pipe, semi-colon, and tab. The syntax for the option would be as follows:

DLM=’char’

where char is equal to the character used as the delimiter in the text file. Since there is no key available for the tab character, you need to use the hexadecimal representation for tab. On ASCII based platforms, this is defined as ‘09’x, and the representation is ‘05’x on EBCDIC platforms. The syntax would be as follows:

DLM=’09’x

DSD - It has three functions when reading delimited files. The first function is to strip off any quotes that surround values in the text file. The second function deals with missing values. When SAS encounters consecutive delimiters in a file, the default action is to treat the delimiters as one unit. If a file has consecutive delimiters, it’s usually because there are missing values between them. DSD tells SAS to treat consecutive delimiters separately; therefore, a value that is missing between consecutive delimiters will be read as a missing value when DSD is specified. The third function assumes the delimiter is a comma. If DSD is specified and the delimiter is a comma, the DLM= option is not necessary. If another delimiter is used, the DLM= option must be used as well. This option became available in SAS 6.07 and is documented in "SAS Technical Report P-222". In Version 7 and beyond, DSD is documented in "SAS Language Reference: Dictionary".

TRUNCOVER- By default, if SAS reads past the end of a record and has not satisfied the INPUT statement, it continues to read data from the next record. This action is called FLOWOVER. MISSOVER is an option that overrides the default action of FLOWOVER, and causes a variable to be set to missing if the INPUT statement is unable to read the entire value. TRUNCOVER is similar to MISSOVER, since it also overrides the default action of FLOWOVER. However, instead of setting a variable’s value to missing, TRUNCOVER writes whatever characters it is able to read to the appropriate variable, so you know what the input record contained. This option became available in SAS 6.07 and is documented in "SAS Technical Report P-222". In Version 7 and beyond, TRUNCOVER is documented in "SAS Language Reference: Dictionary".

LRECL= - This option should be used when the length of the records in a file are greater than 256 bytes (on ASCII platforms). The input buffer is 256 bytes by default, and records that are greater than this length will be truncated when they are read. Setting the LRECL= option to a greater length assures that the input buffer is long enough to read the whole record.

FIRSTOBS= - This option indicates that you want to start reading the input file at the record number specified, rather than beginning with the first record. This option is helpful when reading files that contain a header record, since a user can specify FIRSTOBS=2 and skip the header record entirely.

The following is what a typical INFILE statement may look like:

INFILE "C:\sasfiles\testfile.csv" DLM=’09’x DSD LRECL=1000 TRUNCOVER FIRSTOBS=2;

The INPUT statement is used to list the variables you want to read from the file.

For ex:

DATA A;

INFILE "C:\ sasfiles\testfile.csv";

INPUT VAR1 VAR2 VAR3;

RUN;

INPUTTING DATA:

If your file contains numeric variables or character variables with a length of 8 bytes or less, then you don’t have to worry about using informats to read in the data (although you may still want to use them if you are reading data such as date or time values, etc.). You would just need to denote any character variables with a dollar sign ($).

For ex.

DATA A;

INFILE ‘C:\sas\example1.csv’;

INPUT FNAME $ LNAME $ AGE;

RUN;

In the example above, the variables FNAME and LNAME are character, so they need a $ after each variable name. The variable AGE is numeric, so nothing else is needed. All three variables will have a length of 8 bytes, since no informats were used.

When reading delimited text files, SAS uses the delimiter to determine where one field ends and another begins. Because of this, list input is required. As in the previous example, list input simply lists the variables. However, if your character variables are longer than 8 bytes or if you have values such as dates, times, packed decimal, etc., you need to use an informat and specify a width. This is called formatted input. Using formatted input to read a delimited file can cause problems, because SAS ignores the delimiter and reads the number of bytes specified by the informat. This problem can be corrected by using the colon (:) format modifier, which specifies modified list input. The format modifier tells SAS to read up to the maximum number of bytes specified in the informat, OR until it reaches a delimiter. The delimiter is not ignored.

For ex.

DATA A;

INFILE ‘C:\sas\example2.csv’ DLM=’,’ TRUNCOVER;

INPUT FNAME :$12. LNAME :$20. AGE DATE :MMDDYY10.;

RUN;

To sum it all up, here is an example of a program that reads in a typical comma delimited file.

DATA A;

INFILE ‘C:\sas\example3.csv’ DLM=’,’ DSD LRECL=1000 TRUNCOVER;

INPUT FNAME :$12. LNAME :$20. AGE DATE :MMDDYY10.;

RUN;

(Or)

FILENAME MYFILE ‘C:\sas\example3.csv’;

DATA A;

INFILE MYFILE DLM=’,’ DSD LRECL=1000 TRUNCOVER;

INPUT FNAME :$12. LNAME :$20. AGE DATE :MMDDYY10.;

How to Create Dummy Variables in SAS

Say you have an age variable with five values: 1, 2, 3, 4 and 5, and you want to make five dummy variables of it.

DATA TEST;
INPUT age;
DATALINES;
1 2 3 4 5
;

Method 1

DATA DUMMYMETHOD1;
SET TEST;
age1=(age=1);
age2=(age=2);
age3=(age=3);
age4=(age=4);
age5=(age=5);

PROC FREQ;
TABLE age1 age2 age3 age4 age5;
RUN;

Note: The statement age1=(age=1) will create a variable called age1 with a value of 1 if age=1, and 0 if otherwise.

Method 2

DATA DUMMYMETHOD2 (DROP = i);
SET TEST;
ARRAY A {*} age1 age2 age3 age4 age5;
DO i = 1 TO 5;
A(i) = (age=i);
END;

PROC FREQ;
TABLE age1 age2 age3 age4 age5;
RUN;

Note: An alternative representation to the ARRAY statement above is:
ARRAY A {*} age1-age5;
This alternative is useful if you are creating many dummy variables (assuming age has more than five unique values).

Method 3

DATA DUMMYMETHOD1;
SET TEST;
IF age=1 then age1=1; ELSE age1 = 0;
IF age=2 then age2=1; ELSE age2 = 0;
IF age=3 then age3=1; ELSE age3 = 0;
IF age=4 then age4=1; ELSE age4 = 0;
IF age=5 then age5=1; ELSE age5 = 0;

PROC FREQ;
TABLE age1 age2 age3 age4 age5;
RUN;

Note: This option is not useful for creating dummy variables out of a variable with many unique data values.

How do I Create a SAS Data Set with Compressed Observations?

To create a compressed SAS data set, use the COMPRESS=YES option as an output DATA set option or in an OPTIONS statement. Compressing a data set reduces its size by reducing repeated consecutive characters or numbers to 2-bye or 3-byte representations. To uncompress observations, you must use a DATA step to copy the data set and use option COMPRESS=NO for the new data set.

The advantages of using a SAS compressed data set are reduced storage requirements for the data set and fewer input/output operations necessary to read from and write to the data set during processing. The disadvantages include not being able to use SAS observation number to access an observation. The CPU time required to prepare compressed observations for input/output observations is increased because of the overhead of compressing and expanding the observations. (Note: If there are few repeated characters, a data set can occupy more space in compressed form than in uncompressed form, due to the higher overhead per observation.) For more details on SAS compression see "SAS Language: Reference, Version 6, First Edition, Cary, NC: SAS Institute Inc., 1990".

The two ways to compress data sets in SAS:

Using the option in the DATA step to compress a data set:

data ssd.income (compress=yes);

To compress all data sets created within a SAS sessions:

options compress=yes;

Converting a Numeric Variable to a Character Variable

This is an example of how to change a numeric variable, ID, to character variable.

This example uses PUT function to convert numeric data to character data. The PUT function writes values with a specified format. It takes two arguments: the name of the numeric variable and a SAS format or user-defined format for writing the data.

char_id = put(id, 7.) ;
drop id ;
rename char_id=id ;

Below are a few examples of data conversions using the PUT function:

[convert numeric variable "oldvar"(length) to character variable "newvar"(length)]
oldvar	put function	newvar
303 (8)	newvar=put(oldvar, 3.);	303 (3)
32000 (8)	newvar=put(oldvar, dollar7.);	$32,000 (7)
366 (8)	newvar=put(oldvar, date9.);	01jan1961 (9)

Converting a Character Variable to a Numeric Variable

This is an example of how to change a character variable, AGE, to numeric.

This program reliably changes character variables to numeric variables. The INPUT function takes two arguments: the name of a character variable and a SAS informat or user-defined informat for reading the data.

newage = input(age,3.0);
drop age;
rename newage=age;

Below are a few examples of conversions using INPUT function (character variable "oldvar" to numeric variable "newvar"):

oldvar	input function	newvar
32000	newvar=input(oldvar, 5.);	32000
32000	newvar=input(oldvar, 5.2);	320.00
32,000	newvar=input(oldvar,comma6.);	32000

NOTE: If a character variable only includes numbers, it can be automatically converted to a numeric variable by using it in a numeric context. For example:

newvar = oldvar + 0 ;
newvar = oldvar*1 ;

Working with Dates where Years are in Two Digit Format

Q. Dates in my data have only two digits to represent a year. The digits 00 represent the year 2000, but my SAS program interprets this as 1900. What can I do about this?

A. Use the SAS system option "YEARCUTOFF=" to specify the first year of a 100 year span within which you want your 2-digit dates to fall. For example:

options yearcutoff=1950;

would interpret the 2-digit year values 50-99 as 1950 to 1999 and values 00-49 as being 2000 to 2049.

For SAS Versions 6.04 - 6.12 the system default value is YEARCUTOFF= 1900. This means that all two digit years will be assumed to be between 1900 and 1999, unless you reset the default.

For SAS Version 7.0 -- 8.1 the system default is YEARCUTOFF=1920. This means that all two digit years will be assumed to be between 1920 and 2019, unless you reset the default.

The default value for either version can be easily reset by using the "YEARCUTOFF=" option as shown above.

To find out whether the YEARCUTOFF option has been reset on any version of SAS that you are using you can submit these statements:

proc options option=yearcutoff;
run;

OR -- from SAS 8.1 for Windows just go to Tools --> Options --> System --> Input Control --> Data Processing. Then scroll down through the list of options until you see Yearcutoff...

How Can I Export the Output From SAS To an Excel File?

Method A: The QUICK EASY WAY (one-way tables only)

IF your output is relatively short and involves only one-way tables:

Save your output in html format:

Tools --> Options --> Preferences --> choose the "Results" tab and check the box that says "HTML".

Run your Procedure
Go to the "Results Viewer" to view the html output
Copy and Paste the data cells from the Results Viewer to an Excel table!

Method B: Output your results as SAS data sets and Export to Excel

First, create an output dataset from the results of your frequency distribution. Here are examples of how to output your procedure results into a SAS data set:

For a one way table for a single variable ("v1freq" is the output dataset which will contain the frequency distribution for the "v1" variable)
PROC FREQ DATA = mydata;
TABLES v1 /OUT = v1freq;
RUN;
For multiple one way tables you can use ODS (Output Delivery System) to create separate output datasets for the frequency distribution of each variable:

In the example below "mydatA" is a dataset in my WORK library.
The ODS output statement will create one dataset for each of the variables in the tables statement: "freqs" (which will contain the "name" variable) and "freqs1" which will contain the "v1" variable.
ODS TRACE ON;
PROC FREQ DATA = mydatA;
TABLES name v1;
ODS OUTPUT OneWayFreqs(match_all)=freqs;
RUN;
ODS TRACE OFF;

Second, export the SAS data set to Excel:

Assuming the output SAS data set is stored in the mydat library:

Option 1:

From the SAS menu, click

File --> Export Data --> library="mydat", member=dataset name -->
Check "Standard Data Source" and choose your version of Excel from the picklist provided, --> Next --> Enter the location where you wish to save the Excel file and name the file. -->
Choose whether you want to save a Proc Export file as described in the next window (optional) --> Finish.

Option 2:

From the mydat library in the Explorer Window, right-click on the output SAS data set file, then choose View in Excel.

Method C: Output your results using ODS HTML BODY= statement

The ODS HTML BODY allows the user to save the results as an html file. Usually, the filename extension assigned to the file is .HTML. However, the user can opt to assign a .XLS extension so it can be read in Excel (although .HTML files can be read directly into Excel as well).

ODS HTML BODY = 'C:\thesis\v1.xls';
PROC FREQ DATA=mydata;
TABLES name v1;
ODS OUTPUT OneWayFreqs(match_all)=freqs;
RUN;
ODS HTML CLOSE;

Reading Multiple Raw Data Files

Q. Can I read multiple raw data files in a DATA step to create a combined SAS data set, all my files have the same data layout?

A. If a number of raw data files have the same data layout, all of those files can be included in a FILENAME statement and assigned a single fileref. When this fileref is included in a Data step, SAS can read those input raw data files sequentially.

For example, some of the census data have a separate raw data file for each state, all state files have a similar data layout. To create a combined SAS data set for US, the raw data files can be included in a FILENAME statement to be read in a Data step. Here is an example.

FILENAME myfiles ("c:\mydir\one.dat", "c:\mydir\two.dat", "c:\mydir\three.dat");
DATA temp1;
INFILE myfiles lrecl=331;
INPUT var1 10-15 var2 22-30 var3 98-100;
RUN;

With SAS 8.2 you can also use wildcards in your FILENAME statement like this:

filename myfiles ("c:\mydir\*.csv");
filename myfiles ("c:\mydir\test*.dat");
filename myfiles ("c:\mydir\*.*");

Reading Variable Length Files with SAS

Q. My raw data are not in a rectangular file. Some lines are longer than others. Can I still use SAS to read my data?

A. To read a raw data file with variable length records, use the INFILE options MISSOVER and PAD along with LRECL (lrecl is the maximum record length). Below is an example.

data temp;
infile 'a:\varlen.txt' lrecl=10 missover pad;
input var1 1-2 var2 4-10;
run;

/* Use the following DATA step to determine the maximum record length for the input file. In INFILE statement put an arbitrary high number for LRECL. When you submit these statements SAS log will include the true maximum record length for the file */

data _null_;
infile 'a:\varlen.txt' lrecl=500000 ;
input;
run;

How to Read Space-Delimited ASCII Files Into SAS

To read a space-delimited file you can either use the Import Wizard or use the DATA-INFILE-INPUT statements combination. Import Wizards are only useful when dealing with text or ASCII data files with just a few variables, however, if you are dealing with hundreds, if not thousands, of variables then the DATA-INFILE-INPUT combination is the best one to use.

Reading a Space-Delimited file (POVERTY-LIST.DAT)

A Simple List Input is usually used to read space delimited file where you simply list variables and their properties in the INPUT statement in the order in which the data values appear in the data being read. To read the above data file called poverty-list.dat:

DATA mylib.inputlist;
INFILE 'c:\sasworkshop\poverty-list.dat';
INPUT state $ medinchh medincfam percapinc pctuspovyt pctfampov;
RUN;

Data Requirements for Simple List Input

No field may be skipped. All variables in the external file must be included.
Data values must be separated by at least one blank (space delimited).
No blanks may be included within a data value.
All values must be in every record in the same order.
Missing values must be represented by a placeholder such as a single period (.).
The DATA statement is where you specify the name of SAS data file to be created and where it will be stored.
The INFILE statement indicates the name and location of the external (raw data) file to be read. It may include information about logical record length ("LRECL") of the raw data file. It may include, or be followed by, instructions to read selected observations.
The INPUT statement is where you define the properties of your variables. The variables should be listed in the order in which their corresponding data values appear in the external raw data file.
May include information about where data values for each variable are located in the raw data file
Includes information about characteristics of the variables (numeric versus character values, implied decimals, etc).
The names of variables that contain character values are followed by dollar signs ($).
Variables listed without the dollar sign are assumed to be numeric.
Variable names must begin with a letter or underscore. The rest can be any letter, number or underscore. Variable names can be up to 32 characters long.
The RUN statement tells SAS that this is the step boundary and it should begin processing the DATA step.

Reading Comma-Delimited Raw Data Files

Q. My raw data came from a speadsheet and when I saved it as text, it put commas between every column. Can I read the file into SAS?

A. Yes! There is an INFILE option called DSD that will make reading this file quite simple. DSD option changes the default delimiter to a comma. When you specify DSD, SAS treats two consecutive delimiters as a missing value and removes quotation marks from character values.

Example to read a comma delimited file:

filename in 'c:\my proj\rawdata.txt';     /* this is your raw data file */
    data new;
       infile in dsd;
       input var1-var20;
    run;

Reading Colon or Tab Delimited Raw Data Files

Q. My raw data file has colons as delimiters. Can I still read it into SAS? What about tab-delimited files?

A. Yes, just about any delimiter is possible. In SAS, the default delimiter for an input raw data file is space. To specify your own delimiter you can use an INFILE option called DELIMITER= . If needed, you can also use DSD option. When you specify DSD, SAS treats two consecutive delimiters as a missing value and removes quotation marks from character values. Here are some examples of reading a colon delimited and a tab delimited file.

Example to read a colon delimited file:

filename in 'c:\my folder\rawdata.txt';       /* this is your raw data file */
   data new;
       infile in delimiter=":";             /* use this for colon-delimiters */
       input var1-var20;
    run;

Example to read a tab delimited file:

filename two 'c:\my folder\data2.txt';     /* this is your raw data file */
    data new2;
       infile two DSD delimiter='09'x;      /* use hexidecimal code for tab delimiters */
       input var1-var10 ;
    run;

Sorting Very Large Datasets with SAS

Q. I'm trying to sort a very large SAS dataset (4.49G) and I'm getting the message that SAS is "out of resources". What can I do?

A. Here are some helpful tips passed along to us from other users as well as from SAS:

From an experienced user:

You can better diagnose where you're running into resource problems by setting a couple of system options: MSGLEVEL=i and FULLSTIMER. You need to determine whether you're running out of RAM or disk space. Its often the latter.
With regard to memory -- if, for example, you have 750M RAM, try setting the MEMSIZE option to 700M and the SORTSIZE option to 650M. You need to leave enough room for the operating system & for SAS overhead. *** Make sure you're not running any other processes ***.
About disk space -- A common cause of problems! If you are sorting a temporary (WORK) data set, you need to have room for ***4 copies*** of the dataset in your WORK library (for Windows, 5 if you are using Unix). If you're sorting a permanent dataset (two level name), you need room for 1 copy in the source library, 1 in the destination, and 2 in WORK.

Some more tricks -

Make your data set smaller: eliminate all unnecesary variables and set the LENGTH of variables to be no more than necessary (e.g., 3 for dummy variables, 4 for integers, etc,). these reduce the size of the file to be sorted dramatically & should always be done (if you haven't already). since you only have 1.7 million records but the file is 4.5 GB, i'm guessing you might not have done this yet (or else the file is inherently very wide -- see below). with data files of this size you need to think hard about how to reduce file size.
Try sorting subsets of the data & recombining them --

If the file is very 'wide', split it into multiple files which all contain the sort variables but only contain some of the other variables; you can then do a MERGE/BY to recombine them.
If the files are very 'long', try subsetting the file the file (1/2 the obs in one file, 1/2 in another, say), sort them separately, and then interleave them in a data step.

(the obvious) -- Try and avoid the sort altogether: index the file or re-think the job sequence.

Important tip:
I can't stress enough how much reducing the 'width' of your file by dropping unnecessary variables and setting LENGTHs properly can help. Some combination of the other methods should get you 'sorted out'

From SAS:

Is the the drive formatted as fat32 or ntfs? The fat32 drive would have a file size limit of 2.1gig and this could be causing this error.

Using the WHERE Statement to Subset Data

Use a WHERE statement to select observations that meet a particular condition from a SAS data set. The WHERE statement subsets the input data by specifying certain conditions that each observation must meet before it is available for processing. The conditions that you define in a WHERE statement is an arithemtic or logical expression that generally consists of a sequence of operands and operators. To compare character values, you must enclose them in single or double quotation marks and the values must match exactly, including capitalization. Using the WHERE statement might improve the efficiency of your SAS programs because SAS is not required to read all the observations in the input data set. Note: You can use only one WHERE statement in a Data or a Proc step. Examples are given below.

/* Selecting observations in a Data Step */

libname ssd '[drive]:\[your_folder]\' ;
data subset;
set ssd.all ;
where age <= 65 and upcase(gender) = "M" ;
run;

/* Selecting observations for processing in a Proc Step */

proc means data=ssd.all;
where age <= 65 and upcase(gender) = "M" ;
run;

Taking Random Samples of Observations from a SAS Data

Q. I am looking for a program which will let me take a random sample from a very large one (for example, a sample of 300 from a sample of 10000).

A. One way of selecting a random sample from a data set is to, first, use a DATA step to generate a random vector, then use PROC sort to rearrange the data by that random vector and then select first k observations. Below is a sample program.

DATA dummy;
input var1 @@ ;
cards;
2.1 3.1 4 6 2.2 4.9 4 5 3 3.3 4 5 3 4.3
2.3 4 5 7 3 3 9 11 2 ; /* CREATE A DATA SET */
run;

%let k=10; /* DEFINE SAMPLE SIZE */

DATA dummy ;
SET dummy ;
random=RANUNI(-1); /* GENERATE A RANDOM VECTOR */
run;

PROC SORT DATA=dummy;
BY random; /* SORT OBSERVATIONS BY THE RANDOM VECTOR */
run;

DATA sample;
SET dummy(drop=random);
IF _N_ le &k; /* SELECT THE FIRST K OBSERVATIONS */
run;

proc print;
run;

Q: How can I select a specific number of observations from the top of each BY group?

A: For information about how to select a specific number of observations from the top of each BY group

data groups;

  input group $ amount date date9.;

  format date date9. amount dollar6.;

datalines;

A 1000 06Mar2000

A 550 01Mar2000

A 375 15Mar2000

A 1500 01Jun2000

A 900 15Jul2000

A 800 30Jun2000

B 500 01Mar2000

B 400 15Mar2000

B 1050 01Jun2000

B 330 15Jul2000

B 575 30Jun2000

run;

proc sort;

  by group descending amount;

run;

data top3(drop=count);

  set groups;

  by group descending amount;

  if first.group then count=0;

  count+1;

  if count le 3 then output;

run;

/* RESULTS  */

OBS       GROUP           AMOUNT             DATE

 1           A             $1,500          01JUN2000

 2           A             $1,000          06MAR2000

 3           A               $900          15JUL2000

 4           B             $1,050          01JUN2000

 5           B               $575          30JUN2000

 6           B               $500          01MAR2000

Q: Why do I sometimes get different results when I do a many-to-one merge in one datastep followed by a second datastep with subsetting conditions versus doing the merge and subset in one datastep?

A: When you do a many-to-one merge, the variable values in the non-duplicate dataset will be retained and carry down the BY-Group. The values are not reinitialized to missing until the BY-Group changes. If you are doing a merge and then making changes to one of the variable values which originated in the non-duplicate dataset, that change will carry down the BY-Group. This is why you will see different results from doing the merge and variable manipulation in one datastep vs doing the merge, then doing variable manipulation in a second datastep.

data one;

  input A1 $ B $ C D1;

datalines;

1B 3209 83  2

1A 3209 77  1

1B 3209 77  1

1B 3209 63  1

run;

data two;

  input A $ B $ D;

datalines;

1B 3209 1

run;

/*

   Do the merge and subset in two datasteps

   The dataset MATCH1 is created via a many-to-one merge.

   The value of variable A carries down the BY-Group, as is

   the nature of merge.

*/

data match1;

  merge one(in=in1) two(in=in2);

  by B;

  if in1 and in2;

run;

/*

  Subsetting with a second step

  Since you are setting the

  dataset, any changes that are made to a variable's

  value apply only to that observation.

*/

data match2 ;

  set match1;

  if 2*D-D1 in (0,1);

  if substr(A1,2,1)='A' and substr(A,2,1)='B' then do;

    E='Y';

    substr(A,2,1)='A';

  end;

  else E='N';

  drop A1 B ;

run;

/* RESULTS *

OBS  C   D1  A   D   E

1    83  2   1B  1   N

2    77  1   1A  1   Y

3    77  1   1B  1   N

4    63  1   1B  1   N

*/

/*

Try it all in 1 step. Here, the value of variable A is

changed in the 2nd obs from 1B to 1A. Then, because

you are using BY-Group processing in the merge, the

value is retained for the remainder of the BY-Group

at 1A.

*/

data match3 ;

  merge one(in=in1) two(in=in2);

  by B;

  if in1 and in2;

  if 2*D-D1 in (0,1);

  if substr(A1,2,1)='A' and substr(A,2,1)='B' then do;

    E='Y';

    substr(A,2,1)='A';

  end;

  else E='N';

  drop A1 B ;

run;

/* RESULTS *

OBS  C   D1  A   D   E

1    83   2  1B  1   N

2    77   1  1A  1   Y

3    77   1  1A  1   N

4    63   1  1A  1   N

*/

Q: I am using the MERGE statement to combine two data sets that might have an unequal number of members in each BY group. After the data sets are combined, I want to reset the variables (in the data set) that have the fewest members in the BY group to missing and continue this process until the BY group changes. Can I do this?

A: The MERGE statement generally carries values of unique variables across the BY group. If you do not want this result, you must add additional coding logic.

In a DATA step, the IN= variables automatically reset when the BY group changes. If you reset these IN= variables prior to the MERGE statement, you can conditionally verify whether an observation is contributed from a BY group with each iteration of the DATA step. If there are remaining members of a BY group in one data set, you can set the variables (excluding the BY variables) to missing.

   Starting Data:          --->   Desired Results:

   DATA A        DATA B           DATA NEW

  ID   X       ID    Y               ID    X    Y

   1  11       1   111                1   11  111

   1  22       1   222                1   22  222

   1  33                              1   33    .

   1  44                              1   44    .

   1  55                              1   55    .

DATA  a;

INPUT id x;

DATALINES;

1 11

1 22

1 33

1 44

1 55

RUN;

DATA b;

INPUT id y;

DATALINES;

1 11

1 111

1 222

RUN;

/* Reset IN= variables to 'false', or zero.  If an

   observation contributes from A and not B, then

   reset Y to missing.                            */

DATA new;

ina=0;

inb=0;

MERGE a(in=ina) b(in=inb);

BY id;

IF ina and not inb then y=' ';

RUN;

Q: When I use the MERGE statement to combine two data sets, how can I generate output for only those matches that are found in both data sets?

A: To generate output for only those matches that are found in both data sets, use the IN= option in your MERGE statement. (For details, see "SAS Data Set Options" in SAS Language Reference: Dictionary.) The IN= variable is a Boolean flag that resets when a BY group changes. If you want the MERGE statement to generate output for matches only, test for BY groups that are found in both data sets.

DATA file1;

INPUT var name $;

DATALINES;

100 john

200 joe

400 bill

600 bob

RUN;

PROC SORT data=file1;

BY var;

RUN;

DATA file2;

INFILE cards dsd truncover;

INPUT var address $ 13.;

DATALINES;

100,34 Smith Road

200,67 Burt Ave

300,12 You St

400,45 Younge St

500,79 Wellington

600,23 Done Road

RUN;

PROC SORT data= file2;

BY var;

RUN;

DATA three;

MERGE file1 (in=a) file2 (in=b);

BY var;

IF a and b;

RUN;

PROC PRINT;

RUN;

/* Results */

Obs    var    name       address

 1     100    john    34 Smith Road

 2     200    joe     67 Burt Ave

 3     400    bill    45 Younge St

 4     600    bob     23 Done Road

Q: What does the following message mean, and how does it affect my results? NOTE: MERGE statement has more than one data set with repeats of BY values.

A: This message means that SAS has found duplicate values in a BY group for more than one of your data sets. While this might not cause a problem, SAS will still notify you because duplicates in both data sets can produce unexpected results. Here are two data sets that have duplicates of the BY variable:

/* Note that there are two A's and three C's in this sample. */

DATA one;

INPUT id $ fruit $;

CARDS;

a apple

a apple

b banana

c coconut

c coconut

c coconut

RUN;

PROC SORT data=one;

BY id;

RUN;

/* Note that there are two B's and two C's in this sample. */

DATA two;

INPUT id $ color $;

CARDS;

a amber

b brown

b black

c cocoa

c cream

RUN;

PROC SORT data=two;

BY id;

RUN;

/* There will be two observations of the     */

/* B BY group due to duplicates in DATA two. Note the different       */

/* values of COLOR for the C BY group. These are the correct results;    */

/* however, if you did not know that you had duplicates of C in both */

/* data sets, you might not expect these results.       */

DATA test;

MERGE one two;

BY id;

PROC PRINT;

RUN;

RESULTS:

Obs id fruit   color

1   a  apple   amber

2   a  apple   amber

3   b  banana  brown

4   b  banana  black

5   c  coconut cocoa

6   c  coconut cream

7   c  coconut cream

In this example, SAS notifies you of the presence of duplicates so that you will

not expect COCOA to be assigned across the C BY group.

Q: Why doesn't the macro variable that I created using CALL SYMPUT resolve?

One of the most common problems when you use a macro variable that is created with SYMPUT

is trying to reference it before it is actually created.  When you create a macro

variable using CALL SYMPUT, one of the most important things to know is when the

variable is created.  CALL SYMPUT is a DATA step statement, so it does not

create the macro variable until DATA step execution time.   This is important

because macro variable references resolve during the compilation of the DATA

step.  As a result, the following rules are true:

-You cannot use a macro variable reference to retrieve the value of a macro

variable in the same program (or step) in which SYMPUT creates that macro

variable and assigns it a value.

-You must explicitly use a step boundary statement to force the DATA step to

execute before referencing the macro variable that is created with SYMPUT. The boundary

could be a RUN statement or another DATA or PROC statement.

Q: In SAS 8 or later, are formatted values still limited to 200 characters?

A: No, formatted values are not limited to 200 characters. Beginning with SAS 7, formatted values can consist of a maximum of 32,767 characters.

Q: How can I add the OTHER range to format when creating a format with the CNTLIN= option?

A: In order to add the OTHER range, the variable HLO must be set to 'O'.

The variable HLO's value denotes a special range in an input control data set.

The 'O' (an uppercase o) denotes the special range of OTHER.

Here is a simple example;

DATA ONE;

  INPUT X Y $;

  CARDS;

1 BLUE

2 RED

/*This example uses the END= option to determine when the end of the input data

set has been reached.  When the last observation has been read, the DO loop

will write out one more observation , setting the value of HLO to 'O' and

assigning the LABEL for the values of the OTHER range*/

DATA TWO;

LENGTH LABEL $ 20;

SET ONE END=LAST;

  FMTNAME='COLORS';

        START=X;

        LABEL=Y;

          OUTPUT;

IF LAST THEN DO;

        HLO='O';

        LABEL='UNKNOWN COLOR';

          OUTPUT;

END;

RUN;

PROC PRINT;RUN;

PROC FORMAT CNTLIN=TWO FMTLIB;

RUN;

Q: How can I read in a range of data from an Excel spreadsheet instead of reading the entire spreadsheet?

A: Use the RANGE= option to specify a subset of cells of a worksheet in an Excel workbook when reading the data into a SAS data set with PROC IMPORT:

PROC IMPORT DATAFILE="c:\test\test.xls"

   OUT=work.test DBMS=excel2000 replace;

   RANGE="Sheet1$C6:K60";

   GETNAMES=NO;

 RUN;

Q: How can I convert a numeric date variable to a character variable using PROC SQL?

A: You can use the PUT function within PROC SQL as shown below.

DATA numdate;

date=TODAY();

RUN;

PROC SQL;

CREATE table new as SELECT PUT(date,date9.) as newdate

FROM numdate;

QUIT;

Q: How can I convert a character date variable to a numeric date variable with PROC SQL?

A: You can use the INPUT function in PROC SQL as shown below:

DATA chardate;

date='08/30/2006';

run;

PROC SQL;

create table new as select INPUT(date,mmddyy10.)

as newdate format=mmddyy10.

from chardate;

quit;

Q: How can I retrieve the program name that is currently running in batch or interactively?

A: If you are running in batch, then you can issue the following statement to retrieve the name of the current program:

  %put The current program is %sysfunc(getoption(sysin));

If you are running interactively, then the following code retrieves the path and the name of the current program:

/* NOTE: If more than one editor window is open the STOP statement will need to

         be commented out from the macro.  SASHELP.VEXTFL creates a unique

         fileref for each editor window that is open.  Removing the STOP allows

         us to retrieve the LAST file opened.  */

%macro pname;

%global pgmname;

%let pgmname=;

data _null_;

set sashelp.vextfl;

if (substr(fileref,1,3)='_LN' or substr

(fileref,1,3)='#LN' or substr(fileref,1,3)='SYS') and

index(upcase(xpath),'.SAS')>0 then do;

call symput("pgmname",trim(xpath));

stop;

end;

run;

%mend pname;

%pname;

%put pgmname=&pgmname;

Beginning with SAS®9 the above macro is not needed. There is a new environment variable for the Enhanced Editor named SAS_EXECFILENAME, and you can retrieve the current program by issuing this statement:

%put %sysget(SAS_EXECFILENAME);

There is also an environment variable for the Enhanced Editor named SAS_EXECFILEPATH that contains the full path of the submitted program or catalog entry. The full path includes the folder and the filename.

Q: Does PROC APPEND process all the observations in the BASE= and DATA= data sets?

A: No, PROC APPEND performs an update in place on the BASE= data set; therefore, it just adds the observations from the DATA= data set to the end of the BASE= data set. The observations in the BASE= data set are not read or processed. This means that PROC APPEND is more efficient than using a DATA step with a SET statement to append two SAS data sets

Q: How do I use the CNTLOUT= and CNTLIN= options in PROC FORMAT to move formats from one platform to another?

A: The most robust method available for migrating user-defined format libraries is PROC FORMAT with the CNTLOUT= and CNTLIN= options used in conjunction with the XPORT engine. With this method, all the information needed to re-create a format library is written out to a SAS data set, called a "control data set," in XPORT transport format. This file can then be transferred to the receiving site where the process is reversed by using PROC FORMAT with the CNTLIN= option.

The advantages of this method are as follows:

This method can be used to move across ALL releases of the SAS System, including migrating back to releases beginning with 6.03.
The sending and receiving systems do not have to use the same character set. The formats are created anew using the information stored in the control data set; therefore, later uses of the formats will be done using a binary search.

Disadvantages for this method are limited:

Migrating from Release 6.08 of the SAS System to either Release 6.10 or 6.11 under Windows would require running a 6.08 job to create the CNTLOUT= control data set.

The following code shows the process used to create an output control data set from PROC FORMAT, cntlfmt can be any valid SAS data set name:

  libname library '

location-of-existing-formats-catalog

';

  libname trans xport '

transport-file-name

';

  proc format library=library cntlout=trans.

cntlfmt

Note: As with any transport file in SAS, the file must be created with the appropriate file attributes, when applicable, and the file must be moved using a BINARY or IMAGE transfer when being moved from site to site. For more details on this please refer to SAS Technical Report P-195, Transporting SAS Files between Host Systems.

After moving the XPORT transport file to the receiving site, the following code is used to regenerate the format library.

  libname library '

output-library-for-format-storage

';

  libname trans xport '

transport-file-name-from-sending-site

';

  proc format library=library cntlin=trans.

cntlfmt

Where cntlfmt is the SAME name as was used on the CNTLOUT= option at data set creation time

Q: When I use CALL SYMPUT to create a macro variable containing a number and try to reference that macro variable in subsequent code, I sometimes get error messages. Why?

A: The CALL SYMPUT routine performs an automatic numeric-to-character conversion when it writes the value of the macro variable. This process returns the number right justified in an 8-byte string. The macro facility retains any leading blanks when it stores the macro variable. Therefore, if you later attempt to concatenate the macro variable value to a data set name or a variable name, you will encounter syntax errors.

For example, suppose you need to produce a cumulative sum for every numeric variable in your data set. To make this dynamic, you want to avoid hardcoding the elements of this array, and the number of elements can vary. You can accomplish this by first determining the number of numeric variables in your data set and storing this as a macro variable:

  data _null_;

   length vname $8;

   set mydata;

   ARRAY NUMS {*} _NUMERIC_;

   NUMVARS=DIM(NUMS);

   CALL SYMPUT('NUMVARS',NUMVARS);

   call vname(nums{1},vname);      /*These statements store

   call symput('firstvar',vname);     the names of the first and

   call vname(nums{dim(nums)},vname);   last numeric variables in

   call symput('lastvar',vname);      macro variables for use in

  run;                    the next DATA step. */

Now, you can use this in an ARRAY statement in a subsequent DATA step to dynamically generate the new variables:

  data cumulsum;

   RETAIN CSUM1-CSUM&NUMVARS;

   set mydata;

   array nums {*} &firstvar--&lastvar;

   ARRAY SUMS {*} CSUM1-CSUM&NUMVARS;

   do over nums;

     sums=sums+nums;

   end;

  run;

This program creates the macro variable &NUMVARS, which consists of a number with leading blanks. When you try to append the number to the prefix CSUM in the RETAIN and ARRAY statements, the DATA step produces the error

ERROR: Missing numeric suffix on a numbered variable list (CSUM1-CSUM).

To correct this problem, you must use the LEFT function in CALL SYMPUT to left justify the numeric value and eliminate the leading blanks. It may also be helpful to add the TRIM function as well, in the event that trailing blanks may also cause problems. The corrected CALL SYMPUT statement would appear as

  CALL SYMPUT('NUMVARS',TRIM(LEFT(NUMVARS)));

Q: How does SAS load an array with more than two dimensions?

A: SAS loads an array with more than two dimensions across the first row, then across the second row, then across the third row, etc. When the first matrix is filled, SAS then fills the array's second matrix, loading across the first row. SAS continues in this manner until the array is full.

Q: How can I estimate the amount of memory needed to load a format?

The following can be used to estimate the amount of memory needed to load a

format.

(For the purpose of this discussion, 'formats' refers to formats and

informats)

Formats and memory

------------------

Each user-defined format consists of a header structure, as many range

structures as needed, and as many label structures as needed.

The header structure requires around 100 bytes. Range structures can be "single

value" or "start-end value" structures, and can be character or numeric.

Character ranges can be varying length, however if the labels are all under 16

bytes, a special fixed range size of 16 bytes may be used to improve

performance at run time.

Their approximate length per range:

numeric single value    16

numeric start-end       40

character single value  9+value length, aligned to 8 byte boundary for each

character start-end     16+value lengths

fixed length single     24

fixed length start-end  40

Note that for numeric ranges, if FUZZ= is not specified, the default

is 1e-12, and this requires a start-end range even for a single

value specification. You can reduce your overall memory usage with

FUZZ=0 being specified.

Different label structures exist for standard labels, pictures,

numeric informats, character informats, and labels-as-formats.

Their approximate length per label:

standard label           4+label length, aligned to 8 byte boundary for each

picture label            18+label length, aligned to 8 byte boundary for each

numeric informat         16

character informat       4+label length, aligned to 8 byte boundary for each

format specification     16

OTHER clauses count as a label but not as a range.

Examples

--------

value xxx (fuzz=0) 1='yes' 2='no';

header (100) + 2 numeric single values (2*16=32) + 2 standard label

(4 + length('yes') + 4 + length('no') = 13, aligned to 16), for a total

size of around 148 bytes.

value xxx 1='yes' 2='no'; (no fuzz specification)

header (100) + 2 numeric start-end values (2*40=80) + 2 standard label

(4 + length('yes') + 4 + length('no') = 13, aligned to 16), for a total

size of around 196 bytes.

value $xxx 'abc' = 'xyz';

header (100) + 1 character single value ranges (9+length('abc')=12) +

1 standard label (4+length('xyz')=7, aligned to 8), for a total size of

around 120 bytes.

Consider a format with 20,000 numeric ranges with single values, with

an average label size of 5. We'd use a numeric range size of 16,

and our label size would be 4+5, but aligned to a 16-byte boundary.

header (100) + 20000*16 + 20000*16 = 640,100

This would mean that the format would require around 640K of memory.

If the same format instead had 200,000 ranges, the memory size would

go to around 6.4MB.

Q: How can I permanently store and use formats that I have created?

(For the purpose of this discussion, formats and informats are referred to as

formats)

To permanently store a SAS format so that it can be used in subsequent sessions

of SAS, you will need to use the LIB= option on the PROC FORMAT  statement.

This option specifies the library and/or the catalog name that will contain the

format catalog.  If the catalog name is omitted, then by default the format

catalog will be named FORMATS.

  LIBNAME libref 'path-to-the-format-library';

  PROC FORMAT LIB=libref<.catalog-name>;

Once the formats are permanently stored in a catalog, in order to use them, SAS

will need to know where to find them.  This is accomplished in 2 ways.

1)  By default, SAS will always look in a library referenced by the LIBRARY

libref for  a format catalog.  This method is to be used when there is only one

format library and/or the format catalog name is FORMATS.

2)  If you have multiple format libraries, format catalogs named something

other than FORMATS or if you wish to specify the order in which format catalogs

are searched for a format, you will need to use the FMTSEARCH= option.

  OPTIONS FMTSEARCH=(catalog-specification-1... catalog-specification-n);

This option searches format catalogs in the order listed, until the desired

member is found. The value of libref can be either libref or libref.catalog. If

only the libref is given, SAS assumes that FORMATS is the catalog name.

The WORK.FORMATS catalog is always searched first, and the

LIBRARY.FORMATS catalog is searched next, unless one of them appears in the

FMTSEARCH= list.  If a catalog appears in the FMTSEARCH= list, the catalog

is searched in the order in which it appears in the list. If a catalog in the

list does not exist, that particular item is ignored and searching continues.

To keep from having to list these options each time you invoke SAS, they can be

placed in the AUTOEXEC.SAS file.  This file contains SAS statements that are

executed automatically when you invoke the SAS System or when you start another

SAS process. The autoexec file can contain any valid SAS statements.  For more

information on this file, refer to your system documentation.

Q: How do I convert a SAS data set that contains all numeric data in character variables to a SAS data set that has numeric variables containing the numeric data with the same variable names?

A: This macro changes all the character variables within a SAS data set to numeric.

This macro can also be easily modified to do the opposite of the above. To have all numeric variables changed to character just make the changes described within the 2 comments in the code below.

/** Sample dataset **/

data one;

input @1 a @3 b $10. @14 c $5. @20 d;

cards;

1 2654321234 33456 4

%macro vars(dsn);

  %let list=;

  %let type=;

  %let dsid=%sysfunc(open(&dsn));

  %let cnt=%sysfunc(attrn(&dsid,nvars));

   %do i = 1 %to &cnt;

    %let list=&list %sysfunc(varname(&dsid,&i));

    %let type=&type %sysfunc(vartype(&dsid,&i));

   %end;

  %let rc=%sysfunc(close(&dsid));

  data two(drop=

    %do i = 1 %to &cnt;

     %let temp=%scan(&list,&i);

       _&temp

    %end;);

   set &dsn(rename=(

    %do i = 1 %to &cnt;

     %let temp=%scan(&list,&i);

       &temp=_&temp

    %end;));

    %do j = 1 %to &cnt;

     %let temp=%scan(&list,&j);

   /** Change C to N for numeric to character conversion  **/

     %if %scan(&type,&j) = C %then %do;

   /** Also change INPUT to PUT for numeric to character  **/

      &temp=input(_&temp,8.);

     %end;

     %else %do;

      &temp=_&temp;

     %end;

    %end;

  run;

%mend vars;

%vars(one)

/** Verify conversion has been made **/

proc contents data=two;

run;