Training Manual for Data Analysis using SAS - Sujai Das (little red riding hood read aloud .TXT) 📗
- Author: Sujai Das
Book online «Training Manual for Data Analysis using SAS - Sujai Das (little red riding hood read aloud .TXT) 📗». Author Sujai Das
Training Manual
Data Analysis using SAS
Sujai Das
NIRJAFT, 12 Regent Park, Kolkata - 700040
1. Introduction
SAS (Statistical Analysis System) software is comprehensive software which deals with many problems related to Statistical analysis, Spreadsheet, Data Creation, Graphics, etc. It is a layered, multivendor architecture. Regardless of the difference in hardware, operating systems, etc., the SAS applications look the same and produce the same results. The three components of the SAS System are Host, Portable Applications and Data. Host provides all the required interfaces between the SAS system and the operating environment. Functionalities and applications reside in Portable component and the user supplies the Data. We, in this course will be dealing with the software related to perform statistical analysis of data.
Windows of SAS
1. Program Editor : All the instructions are given here.
2. Log : Displays SAS statements submitted for execution and messages
3. Output : Gives the output generated
Rules for SAS Statements
1. SAS program communicates with computer by the SAS statements.
2. Each statement of SAS program must end with semicolon (;).
3. Each program must end with run statement.
4. Statements can be started from any column.
5. One can use upper case letters, lower case letters or the combination of the two.
Basic Sections of SAS Program
1. DATA section
2. CARDS section
3. PROCEDURE section
Data Section
We shall discuss some facts regarding data before we give the syntax for this section.
Data value: A single unit of information, such as name of the specie to which the tree belongs, height of one tree, etc.
Variable: A set of values that describe a specific data characteristic e.g. diameters of all trees in a group. The variable can have a name upto a maximum of 8 characters and must begin with a letter or underscore. Variables are of two types:
Character Variable: It is a combination of letters of alphabet, numbers and special characters or symbols.
Numeric Variable: It consists of numbers with or without decimal points and with + or -ve signs.
Observation: A set of data values for the same item i.e. all measurement on a tree. Data section starts with Data statements as
DATA NAME (it has to be supplied by the user);
Input Statements
Input statements are part of data section. This statement provides the SAS system the name of the variables with the format, if it is formatted.
List Directed Input
Data are read in the order of variables given in input statement.
Data values are separated by one or more spaces.
Missing values are represented by period (.).
Character values are followed by $ (dollar sign).
Example
Data A;
INPUT ID SEX $ AGE HEIGHT WEIGHT; CARDS;
1 M 23 68 155
2 F . 61 102
3. M 55 70 202
;
Column Input
Starting column for the variable can be indicated in the input statements for example:
INPUT ID 1-3 SEX $ 4 HEIGHT 5-6 WEIGHT 7-11; CARDS;
001M68155.5
2F61 99
3M53 33.5
;
Alternatively, starting column of the variable can be indicated along with its length as
INPUT @ 1 ID 3.
@ 4 SEX $ 1.
@ 9 AGE 2.
@ 11 HEIGHT 2.
@ 16 V_DATE MMDDYY 6.
;
Reading More than One Line Per Observation for One Record of Input Variables
INPUT # 1 ID 1-3 AGE 5-6 HEIGHT 10-11
# 2 SBP 5-7 DBP 8-10; CARDS;
001 56 72
140 80
;
Reading the Variable More than Once
Suppose id variable is read from six columns in which state code is given in last two columns of id variable for example:
INPUT @ 1 ID 6. @ 5 STATE 2.; OR
INPUT ID 1-6 STATE 5-6;
Formatted Lists
DATA B;
INPUT ID @1(X1-X2)(1.)
@4(Y1-Y2)(3.); CARDS;
11 563789
22 567987
;
PROC PRINT; RUN;
Output
Obs.
ID
x1
x2
y1
y2
1
11
1
1
563
789
2
22
2
2
567
987
DATA C;
INPUT X Y Z @; CARDS;
1 1 1 2 2 2 5 5 5 6 6 6
1 2 3 4 5 6 3 3 3 4 4 4
;
PROC PRINT; RUN;
Output
Obs. X Y Z
1 1 1 1
2 1 2 3
DATA D;
INPUT X Y Z @@;
CARDS;
1 1 1 2 2 2 5 5 5 6 6 6
1 2 3 4 5 6 3 3 3 4 4 4
;
PROC PRINT; RUN;
Output:
Obs.
X
Y
Z
1
1
1
1
2
2
2
2
3
5
5
5
4
6
6
6
5
1
2
3
6
4
5
6
7
3
3
3
8
4
4
4
SAS System Can Read and Write
DATA FILES
A. Simple ASCII files are read with input and infile statements
B. Output Data files
Creation of SAS Data Set
DATA EX1;
INPUT GROUP $ X Y Z; CARDS;
T1 12 17 19
T2 23 56 45
T3 19 28 12
T4 22 23 36
T5 34 23 56
;
Creation of SAS File From An External (ASCII) File
DATA EX2;
INFILE 'B:MYDATA'; INPUT GROUP $ X Y Z;
OR
DATA EX2A;
FILENAME ABC 'B:MYDATA'; INFILE ABC;
INPUT GROUP $ X Y Z;
;
Creation of A SAS Data Set and An Output ASCII File Using an External File
DATA EX3;
FILENAME IN 'C:MYDATA';
FILENAME OUT 'A:NEWDATA'; INFILE IN;
FILE OUT;
INPUT GROUP $ X Y Z; TOTAL =SUM (X+Y+Z);
PUT GROUP $ 1-10 @12 (X Y Z TOTAL)(5.); RUN;
This above program reads raw data file from 'C: MYDATA', and creates a new variable TOTAL
and writes output in the file 'A: NEWDATA’.
Creation of SAS File from an External (*.csv) File
data EX4;
infile'C:UsersAdmnDesktopsscnars.csv' dlm=',' ;
/*give the exact path of the file, file should not have column headings*/
input sn loc $ year season $ crop $ rep trt gyield syield return kcal; /*give the variables in ordered list in the file*/
/*if we have the first row as names of the columns then we can write in the above statement
firstobs=2 so that data is read from row 2 onwards*/ biomass=gyield+syield; /*generates a new variable*/ proc print data=EX4;
run;
Note: To create a SAS File from a *.txt file, only change csv to txt and define delimiter as per file created.
Creation of SAS File from an External (*.xls) File
Note: it is always better to copy the name of the variables as comment line before Proc Import.
/* name of the variables in Excel File provided the first row contains variable name*/
proc import datafile = 'C:UsersDesktopDATA_EXERCISEdescriptive_stats.xls'
/*give the exact path of the file*/
out = descriptive_stats replace; /*give output file name*/
proc print;
run;
If we want to make some transformations, then we may use the following statements:
data a1;
set descriptive_stats;
x = fs45+fw;
run;
Here proc import allows the SAS user to import data from an EXCEL spreadsheet into SAS. The datafile statement provides the reference location of the file. The out statement is used to name the SAS data set that has been created by the import procedure. Print procedure has been utilized to view the contents of the SAS data set descriptive_stats. When we run above codes we obtain the output which will same as shown above because we are using the same data.
Creating a Permanent SAS Data Set LIBNAME XYZ 'C:SASDATA'; DATA XYZ.EXAMPLE;
INPUT GROUP $ X Y Z; CARDS;
.....
.....
..... RUN;
This program reads data following the cards statement and creates a permanent SAS data set in a subdirectory named SASDATA on the C: drive.
Using Permanent SAS File
LIBNAME XYZ 'C:SASDATA';
PROC MEANS DATA=XYZ.EXAMPLE; RUN;
TITLES
One can enter upto 10 titles at the top of output using TITLE statement in your procedure.
PROC PRINT;
TITLE ‘HEIGHT-DIA STUDY’; TITLE3 ‘1999 STATISTICS’; RUN;
Comment cards can be added to the SAS program using
/* COMMENTS */;
FOOTNOTES
One can enter upto 10 footnotes at the bottom of your output.
PROC PRINT DATA=DIAHT; FOOTNOTE ‘1999’;
FOOTNOTE5 ‘STUDY RESULTS’; RUN;
For obtaining output as RTF file, use the following statements
Ods rtf file=’xyz.rtf’ style =journal; Ods rtf close;
For obtaining output as PDF/HTML file, replace rtf with pdf or html in the above statements. If we want to get the output in continuos format, then we may use
Ods rtf file=’xyz.rtf’ style =journal bodytitle startpage=no;
LABELLING THE VARIABLES
Data dose;
title ‘yield with factors N P K’;
input N P K Yield;
Label N = “Nitrogen”; Label P = “ Phosphorus”; Label K = “ Potassium”; cards;
...
...
...
;
Proc print;
run;
We can define the linesize in the output using statement OPTIONS. For example, if we wish that the output should have the linesize (number of columns in a line) is 72 use Options linesize
=72; in the beginning.
2. Statistical Procedure
SAS/STAT has many capabilities using different procedures with many options. There are a total of 73 PROCS in SAS/STAT. SAS/STAT is capable of performing a wide range of statistical analysis that includes:
1. Elementary / Basic Statistics
2. Graphs/Plots
3. Regression and Correlation Analysis
4. Analysis of Variance
5. Experimental Data Analysis
6. Multivariate Analysis
7. Principal Component Analysis
8. Discriminant Analysis
9. Cluster Analysis
10. Survey Data Analysis
11. Mixed model analysis
12. Variance Components Estimation
13. Probit Analysis and many more…
A brief on SAS/STAT Procedures is available at http://support.sas.com/rnd/app/da/stat/procedures/Procedures.html
Example 2.1: To Calculate the Means and Standard Deviation: DATA TESTMEAN;
INPUT GROUP $ X Y Z; CARDS;
CONTROL 12 17 19
TREAT1 23 25 29
TREAT2 19 18 16
TREAT3 22 24 29
CONTROL 13 16 17
TREAT1 20 24 28
TREAT2 16 19 15
TREAT3 24 26 30
CONTROL 14 19 21
TREAT1 23 25 29
TREAT2 18 19 17
TREAT3 23 25 30
;
PROC MEANS; VAR X Y Z; RUN;
The default output displays mean, standard deviation, minimum value, maximum value of the desired variable. We can choose the required statistics from the options of PROC MEANS. For example, if we require mean, standard deviation, median, coefficient of variation, coefficient of skewness, coefficient of kurtosis, etc., then we can write
PROC MEANS mean std median cv skewness kurtosis; VAR X Y Z;
RUN;
The default output is 6 decimal places, desired number of decimal places can be defined by using option maxdec=…. For example, for an output with three decimal places, we may write
PROC MEANS mean std median cv skewness kurtosis maxdec=3; VAR X Y Z;
RUN;
For obtaining means group wise use, first sort the data by groups using
Proc sort; By group; Run;
And then make use of the following
PROC MEANS; VAR X Y Z;
by group; RUN;
Or alternatively, me may use PROC MEANS; CLASS GROUP; VAR X Y Z;
RUN;
For obtaining descriptive statistics for a given data one can use PROC SUMMARY. In the above example, if one wants to obtain mean standard deviation, coefficient of variation, coefficient of skewness and kurtosis, then one may utilize the following:
PROC SUMMARY PRINT MEAN STD CV SKEWNESS KURTOSIS; CLASS GROUP;
VAR X Y Z; RUN;
Most of the Statistical Procedures require that the data should be normally distributed. For testing the normality of data, PROC UNIVARIATE may be utilized.
PROC UNIVARIATE NORMAL; VAR X Y Z;
RUN;
If different plots are required then, one may use:
PROC UNIVARIATE DATA=TEST NORMAL PLOT;
/*plot option displays stem-leaf, boxplot & Normal prob plot*/ VAR X Y Z;
/*creates side by side BOX-PLOT group-wise. To use this option first sort the file on by variable*/
BY GROUP;
HISTOGRAM/KERNEL NORMAL; /*displays kernel density along with normal curve*/ PROBPLOT; /*plots probability plot*/
QQPLOT X/NORMAL SQUARE; /*plot quantile-quantile QQ-plot*/
CDFPLOT X/NORMAL; /*plots CDF plot*/
/*plots pp plot which compares the empirical cumulative distribution function (ecdf) of a variable with a specified theoretical cumulative distribution function. The beta, exponential, gamma, lognormal, normal, and Weibull distributions are available in both statements.*/
PPPLOT X/NORMAL;
RUN;
Example 2.2: To Create Frequency Tables
DATA TESTFREQ;
INPUT AGE $ ECG CHD $ CAT $ WT; CARDS;
<55 0 YES YES 1
<55 0 YES YES 17
<55 0 NO YES 7
<55 1 YES NO 257
<55 1 YES YES 3
<55 1 YES NO 7
<55 1 NO YES 1
55+ 0 YES YES 9
55+ 0 YES NO 15
55+ 0 NO YES 30
55+ 1 NO NO 107
55+ 1 YES YES 14
55+ 1 YES NO 5
55+ 1 NO YES 44
55+ 1 NO NO 27
;
PROC FREQ DATA=TESTFREQ;
TABLES AGE*ECG/MISSING CHISQ; TABLES AGE*CAT/LIST;
RUN:
SCATTER PLOT
PROC PLOT DATA = DIAHT; PLOT HT*DIA = ‘*’;
/*HT=VERTICAL AXIS DIA = HORIZONTAL AXIS.*/ RUN;
CHART
PROC CHART DATA = DIAHT; VBAR HT;
RUN;
PROC CHART DATA = DIAHT; HBAR DIA;
RUN;
PROC CHART DATA = DIAHT; PIE HT;
RUN;
Example 2.3: To Create A Permanent SAS DATASET and use that for Regression
LIBNAME FILEX 'C:SASRPLIB'; DATA FILEX.RP;
INPUT X1-X5;
CARDS;
1 0 0 0 5.2
.75 .25 0 0 7.2
.75 0 .25 0 5.8
.5 .25 .25 0 6.3
.75 0 0 .25 5.5
.5 0 .25 .25 5.7
.5 .25 0 .25 5.8
.25 .25 .25 .25 5.7
; RUN;
LIBNAME FILEX 'C:SASRPLIB'; PROC REG DATA=FILEX.RP; MODEL X5 = X1 X2/P;
MODEL X5 = X1 X2 X3 X4 / SELECTION = STEPWISE;
TEST: TEST X1-X2=0; RUN;
Various other commonly used PROC Statements are PROC ANOVA, PROC GLM; PROC CORR; PROC NESTED; PROC
Comments (0)