Getting Started

Before you can use the SWAT package, you will need a running CAS server. The SWAT package can connect to either the binary port or the HTTP port. If you have the option of either, the binary protocol will give you better performance.

Other than the CAS host and port, you just need a user name and password to connect. User names and passwords can be implemented in various ways, so you may need to see your system administrator on how to acquire an account.

To connect to a CAS server, you simply import SWAT and use the swat.CAS class to create a connection.

In [1]: import swat

In [2]: conn = swat.CAS(host, port, userid, password)

Now that we have a connection to CAS, we can run some actions on it.

Running CAS Actions

To test your connection, you can run the serverstatus action.

In [3]: out = conn.serverstatus()
NOTE: Grid node action status report: 1 nodes, 6 total actions executed.

In [4]: out
Out[4]: 
[About]

 {'CAS': 'Cloud Analytic Services',
  'Copyright': 'Copyright © 2014-2018 SAS Institute Inc. All Rights Reserved.',
  'Documentation': 'http://mycompany.com:8080/job/Actions_ref_doc/ws/casaref/index.html',
  'ServerTime': '2018-07-25T18:38:08Z',
  'System': {'Hostname': 'cas01',
   'Linux Distribution': 'Red Hat Enterprise Linux Server release 6.6 (Santiago)',
   'Model Number': 'x86_64',
   'OS Family': 'LIN X64',
   'OS Name': 'Linux',
   'OS Release': '2.6.32-504.12.2.el6.x86_64',
   'OS Version': '#1 SMP Sun Feb 1 12:14:02 EST 2015'},
  'Version': '3.04',
  'VersionLong': 'V.03.04M0D07242018',
  'license': {'expires': '20Sep2018:00:00:00',
   'gracePeriod': 62,
   'site': 'SAS Institute Inc.',
   'siteNum': 1,
   'warningPeriod': 31}}

[server]

 Server Status
 
    nodes  actions
 0      1        6

[nodestatus]

 Node Status
 
     name        role  uptime  running  stalled
 0  cas01  controller   0.247        0        0

+ Elapsed: 0.789s, user: 0.79s, mem: 1.11mb

Handling the Output

All CAS actions return a CASResults object. This is simply an ordered Python dictionary with a few extra methods and attributes added. In the output above, you’ll see the keys of the dictionary surrounded in square brackets. They are ‘About’, ‘server’, and ‘nodestatus’. Since this is a dictionary, you can just use the standard way of accessing keys.

In [5]: out['nodestatus']
Out[5]: 
Node Status

    name        role  uptime  running  stalled
0  cas01  controller   0.247        0        0

In addition, you can access the keys as attributes. This convenience was added to keep your code looking a bit cleaner. However, be aware that if the name of a key collides with a standard Python attribute or method, you’ll get that attribute or method instead. So this form is fine for interactive programming, but you may want to use the syntax above for actual programs.

In [6]: out.nodestatus
Out[6]: 
Node Status

    name        role  uptime  running  stalled
0  cas01  controller   0.247        0        0

The types of the result keys can vary as well. In this case, the ‘About’ key holds a dictionary. The ‘server’ and ‘nodestatus’ keys hold SASDataFrame objects (a subclass of pandas.DataFrame).

In [7]: for key, value in out.items():
   ...:     print(key, type(value))
   ...: 
About <class 'dict'>
server <class 'swat.SASDataFrame'>
nodestatus <class 'swat.SASDataFrame'>

Since the values in the result are standard Python (and pandas) objects, you can work with them as you normally do.

In [8]: out.nodestatus.role
Out[8]: 
0    controller
Name: role, dtype: object

In [9]: out.About['Version']
Out[9]: '3.04'

Simple Statistics

We can’t have a getting started section without doing some sort of statistical analysis. First, we need to see what CAS action sets are loaded. We can get a listing of all of the action sets and actions using the help CAS action. If you run help without any arguments, it will display all of the loaded actions and their descriptions. Rather than printing that large listing, we’ll specifically ask for the simple action set since we already know that’s the one we want.

In [10]: conn.help(actionset='simple');

Let’s start with the summary action. Of course, we first need to load some data. The simplest way to load data is to do it from the client side. Note that while this is the simplest way, it’s probably not the best way for large data sets. Those should be loaded from the server side if possible.

The CAS.read_csv() method works just like the pandas.read_csv() function. In fact, CAS.read_csv() uses pandas.read_csv() in the background. When pandas.read_csv() finishes parsing the CSV file into a pandas.DataFrame, it gets uploaded to a CAS table by CAS.read_csv(). The returned object is a CASTable object.

In [11]: tbl = conn.read_csv('https://raw.githubusercontent.com/'
   ....:                     'sassoftware/sas-viya-programming/master/data/cars.csv')
   ....: 
NOTE: Cloud Analytic Services made the uploaded file available as table TMP500JRX72 in caslib CASUSER(kesmit).
NOTE: The table TMP500JRX72 has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services.

CASTable objects are essentially client-side views of the table of data in the CAS server. You can interact with them using CAS actions as well as many of the pandas.DataFrame methods and attributes. The pandas.DataFrame API is mirrored as much as possible, the only difference is that behind-the-scenes the real work is being done by CAS.

If you don’t want the difficult-to-read generated name for a table, you can specify one using the casout= parameter.

In [12]: tbl = conn.read_csv('https://raw.githubusercontent.com/'
   ....:                     'sassoftware/sas-viya-programming/master/data/cars.csv',
   ....:                     casout='cars')
   ....: 
NOTE: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSER(kesmit).
NOTE: The table CARS has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services.

Since we started down this path with the intent to use the summary action, let’s do that first.

In [13]: out = conn.summary(table=tbl)

In [14]: out
Out[14]: 
[Summary]

 Descriptive Statistics for CARS
 
         Column      Min       Max      N    ...          TValue          ProbT  Skewness   Kurtosis
 0         MSRP  10280.0  192465.0  428.0    ...       34.894059  4.160412e-127  2.798099  13.879206
 1      Invoice   9875.0  173560.0  428.0    ...       35.196963  2.684398e-128  2.834740  13.946164
 2   EngineSize      1.3       8.3  428.0    ...       59.656105  3.133745e-209  0.708152   0.541944
 3    Cylinders      3.0      12.0  426.0    ...       76.913766  1.515569e-251  0.592785   0.440378
 4   Horsepower     73.0     500.0  428.0    ...       62.173176  4.185344e-216  0.930331   1.552159
 5     MPG_City     10.0      60.0  428.0    ...       79.229235  1.866284e-257  2.782072  15.791147
 6  MPG_Highway     12.0      66.0  428.0    ...       96.729204  1.665621e-292  1.252395   6.045611
 7       Weight   1850.0    7190.0  428.0    ...       97.526890  5.812547e-294  0.891824   1.688789
 8    Wheelbase     89.0     144.0  428.0    ...      269.196577   0.000000e+00  0.962287   2.133649
 9       Length    143.0     238.0  428.0    ...      268.525733   0.000000e+00  0.181977   0.614725
 
 [10 rows x 17 columns]

+ Elapsed: 0.0171s, user: 0.013s, sys: 0.008s, mem: 5.38mb

In addition, you can also call the summary action directly on the CASTable object. It will automatically populate the table= parameter.

In [15]: out = tbl.summary()

In [16]: out
Out[16]: 
[Summary]

 Descriptive Statistics for CARS
 
         Column      Min       Max      N    ...          TValue          ProbT  Skewness   Kurtosis
 0         MSRP  10280.0  192465.0  428.0    ...       34.894059  4.160412e-127  2.798099  13.879206
 1      Invoice   9875.0  173560.0  428.0    ...       35.196963  2.684398e-128  2.834740  13.946164
 2   EngineSize      1.3       8.3  428.0    ...       59.656105  3.133745e-209  0.708152   0.541944
 3    Cylinders      3.0      12.0  426.0    ...       76.913766  1.515569e-251  0.592785   0.440378
 4   Horsepower     73.0     500.0  428.0    ...       62.173176  4.185344e-216  0.930331   1.552159
 5     MPG_City     10.0      60.0  428.0    ...       79.229235  1.866284e-257  2.782072  15.791147
 6  MPG_Highway     12.0      66.0  428.0    ...       96.729204  1.665621e-292  1.252395   6.045611
 7       Weight   1850.0    7190.0  428.0    ...       97.526890  5.812547e-294  0.891824   1.688789
 8    Wheelbase     89.0     144.0  428.0    ...      269.196577   0.000000e+00  0.962287   2.133649
 9       Length    143.0     238.0  428.0    ...      268.525733   0.000000e+00  0.181977   0.614725
 
 [10 rows x 17 columns]

+ Elapsed: 0.0154s, user: 0.015s, sys: 0.004s, mem: 5.36mb

Again, the output is a CASResults object (a subclass of a Python dictionary), so we can pull off the keys we want (there is only one in this case). This key contains a SASDataFrame, but since it’s a subclass of pandas.DataFrame, you can do all of the standard DataFrame operations on it.

In [17]: summ = out.Summary

In [18]: summ = summ.set_index('Column')

In [19]: summ.loc['Cylinders', 'Max']
Out[19]: 12.0

Loading CAS Action Sets

While CAS comes with a few pre-loaded action sets, you will likely want to load action sets with other capabilities such as percentiles, Data step, SQL, or even machine learning. Most action sets will require a license to run them, so you’ll have to take care of those issues before you can load them.

The action used to load action sets is called loadactionset.

In [20]: conn.loadactionset('percentile')
NOTE: Added action set 'percentile'.
Out[20]: 
[actionset]

 'percentile'

+ Elapsed: 0.000494s, user: 0.001s, mem: 0.247mb

Once you load an action set, its actions will be automatically added as methods to the CAS connection and any CASTable objects associated with that connection.

In [21]: tbl.percentile()
Out[21]: 
[Percentile]

 Percentiles for CARS
 
        Variable  Pctl     Value  Converged
 0          MSRP  25.0  20329.50        1.0
 1          MSRP  50.0  27635.00        1.0
 2          MSRP  75.0  39215.00        1.0
 3       Invoice  25.0  18851.00        1.0
 4       Invoice  50.0  25294.50        1.0
 5       Invoice  75.0  35732.50        1.0
 6    EngineSize  25.0      2.35        1.0
 7    EngineSize  50.0      3.00        1.0
 8    EngineSize  75.0      3.90        1.0
 9     Cylinders  25.0      4.00        1.0
 10    Cylinders  50.0      6.00        1.0
 11    Cylinders  75.0      6.00        1.0
 12   Horsepower  25.0    165.00        1.0
 13   Horsepower  50.0    210.00        1.0
 14   Horsepower  75.0    255.00        1.0
 15     MPG_City  25.0     17.00        1.0
 16     MPG_City  50.0     19.00        1.0
 17     MPG_City  75.0     21.50        1.0
 18  MPG_Highway  25.0     24.00        1.0
 19  MPG_Highway  50.0     26.00        1.0
 20  MPG_Highway  75.0     29.00        1.0
 21       Weight  25.0   3103.00        1.0
 22       Weight  50.0   3474.50        1.0
 23       Weight  75.0   3978.50        1.0
 24    Wheelbase  25.0    103.00        1.0
 25    Wheelbase  50.0    107.00        1.0
 26    Wheelbase  75.0    112.00        1.0
 27       Length  25.0    178.00        1.0
 28       Length  50.0    187.00        1.0
 29       Length  75.0    194.00        1.0

+ Elapsed: 0.0487s, user: 0.075s, sys: 0.027s, mem: 14.2mb

Note that the percentile action set has an action called percentile in it. you can call the action either as tbl.percentile or tbl.percentile.percentile.

CAS Tables as DataFrames

As we mentioned previously, CASTable objects implement many of the pandas.DataFrame methods and properties. This means that you can use the familiar pandas.DataFrame API, but use it on data that is far too large for pandas to handle. Here are a few simple examples.

In [22]: tbl.head()
Out[22]: 
Selected Rows from Table CARS

    Make           Model   Type Origin   ...   MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia   ...          23.0  4451.0      106.0   189.0
1  Acura  RSX Type S 2dr  Sedan   Asia   ...          31.0  2778.0      101.0   172.0
2  Acura         TSX 4dr  Sedan   Asia   ...          29.0  3230.0      105.0   183.0
3  Acura          TL 4dr  Sedan   Asia   ...          28.0  3575.0      108.0   186.0
4  Acura      3.5 RL 4dr  Sedan   Asia   ...          24.0  3880.0      115.0   197.0

[5 rows x 15 columns]
In [23]: tbl.describe()
Out[23]: 
                MSRP        Invoice  EngineSize     ...           Weight   Wheelbase      Length
count     428.000000     428.000000  428.000000     ...       428.000000  428.000000  428.000000
mean    32774.855140   30014.700935    3.196729     ...      3577.953271  108.154206  186.362150
std     19431.716674   17642.117750    1.108595     ...       758.983215    8.311813   14.357991
min     10280.000000    9875.000000    1.300000     ...      1850.000000   89.000000  143.000000
25%     20329.500000   18851.000000    2.350000     ...      3103.000000  103.000000  178.000000
50%     27635.000000   25294.500000    3.000000     ...      3474.500000  107.000000  187.000000
75%     39215.000000   35732.500000    3.900000     ...      3978.500000  112.000000  194.000000
max    192465.000000  173560.000000    8.300000     ...      7190.000000  144.000000  238.000000

[8 rows x 10 columns]
In [24]: tbl[['MSRP', 'Invoice']].describe(percentiles=[0.3, 0.7])
Out[24]: 
                MSRP        Invoice
count     428.000000     428.000000
mean    32774.855140   30014.700935
std     19431.716674   17642.117750
min     10280.000000    9875.000000
30%     22000.000000   20284.000000
50%     27635.000000   25294.500000
70%     35940.000000   32997.000000
max    192465.000000  173560.000000

For more information about CASTable, see the API Reference.

Closing the Connection

When you are finished with the connection, it’s always a good idea to close it.

In [25]: conn.close()

Authentication

While it is possible to put your username and password in the CAS constructor, it’s generally not a good idea to have a password in your code. To get around this issue, the CAS class supports authinfo files. Authinfo files are a a file used to store username and password information for specified hostname and port. They are protected by file permissions so that only you can read them. This allows you to set and protect your passwords in one place and have them used by all of your programs.

The format of the file is as follows:

host HOST user USERNAME password PASSWORD port PORT

machine is a synonym for host, login and account are synonyms for user, and protocol is a synonym for port.

You can specify as many of the host lines as possible. The port field is optional. If it is left off, all ports will use the same password. Hostnames much match the hostname used in the CAS constructor exactly. It does not do any DNS expanding of the names. So ‘host1’ and ‘host1.my-company.com’ are considered two different hosts.

Here is an exmaple for a user named ‘user01’ and password ‘!s3cret’ on host ‘cas.my-company.com’ and port 12354:

host cas.my-company.com port 12354 username user01 password !s3cret

By default, the authinfo files are looked for in your home directory under the name .authinfo. You can also use the name .netrc which is the name of an older specification that authinfo was based on.

The permissions on the file must be readable and writable by the owner only. This is done with the following command:

chmod 0600 ~/.authinfo

If you don’t want to use an authinfo in your home directory, you can specify the name of a file explicitly using the authinfo= parameter.

In [26]: conn = swat.CAS('cas.my-company.com', 12354, authinfo='/path/to/authinfo.txt')