Loading Data¶

There are various ways of loading data into CAS. They range from parsing client-side files from various formats into pandas.DataFrame objects to loading large data files that are stored on the server. Which method you choose depends on your needs and how large the data is. For small data sets or data that you want to use a custom parser for, you can use client-side data. For large data sets, you would use server-side data files.

Client-Side Data Files and Sources¶

Using Client-Side Parsers¶

The easiest way to get data into CAS is using the data loading methods on the CAS object that parallel data reading operations in the pandas module. This includes pandas.read_csv(), pandas.read_table(), pandas.read_excel(), etc. The same methods exist on CAS objects as well and, in fact, use the Pandas functions in the background to do the parsing. The only difference is that the result of the method is a CASTable object rather than a pandas.DataFrame. Let’s look at an example.

In this example, we are pointing to a URL that references CSV data. You could just as easily point to a local file. Just keep in mind that when using a URL, the data is downloaded from wherever it is to the client machine for parsing before it is uploaded to CAS.

In [1]: conn = swat.CAS(host, port, username, password)

In [2]: cars = conn.read_csv('https://raw.githubusercontent.com/'
   ...:                      'sassoftware/sas-viya-programming/master/data/cars.csv')
   ...: 
NOTE: Cloud Analytic Services made the uploaded file available as table TMPKMJX6BEE in caslib CASUSER(kesmit).
NOTE: The table TMPKMJX6BEE has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services.

In [3]: cars.head()
Out[3]: 
Selected Rows from Table TMPKMJX6BEE

    Make           Model   Type Origin   ...   MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia   ...          23.0  4451.0      106.0   189.0
1  Acura  RSX Type S 2dr  Sedan   Asia   ...          31.0  2778.0      101.0   172.0
2  Acura         TSX 4dr  Sedan   Asia   ...          29.0  3230.0      105.0   183.0
3  Acura          TL 4dr  Sedan   Asia   ...          28.0  3575.0      108.0   186.0
4  Acura      3.5 RL 4dr  Sedan   Asia   ...          24.0  3880.0      115.0   197.0

[5 rows x 15 columns]

If all went well, you should now have the cars data set in your CAS session.

As we mentioned earlier, these methods on the CAS object use the Pandas functions under-the-covers to do the parsing. This means that you can also use all of the Pandas parsing options.

In [4]: cars = conn.read_csv('https://raw.githubusercontent.com/'
   ...:                      'sassoftware/sas-viya-programming/master/data/cars.csv',
   ...:                      usecols=[0, 1, 2, 3], names=['A', 'B', 'C', 'D'], skiprows=1)
   ...: 
NOTE: Cloud Analytic Services made the uploaded file available as table TMPBTLBN0A_ in caslib CASUSER(kesmit).
NOTE: The table TMPBTLBN0A_ has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services.

In [5]: cars.head()
Out[5]: 
Selected Rows from Table TMPBTLBN0A_

       A               B      C     D
0  Acura             MDX    SUV  Asia
1  Acura  RSX Type S 2dr  Sedan  Asia
2  Acura         TSX 4dr  Sedan  Asia
3  Acura          TL 4dr  Sedan  Asia
4  Acura      3.5 RL 4dr  Sedan  Asia

Note

Loading data from the client-side will be slower than loading data files from the server. Client-side data loading is intended for smaller data sets.

Parsing Client-Side Data on the Server¶

If you don’t need the full power of Pandas’ parsers, you may be better off uploading the file to CAS and parse it there. This offers some advantages as well. The server parsers will likely be faster than a client-side parser (especially in MPP mode where parsing of some file types can happen in parallel). Doing server-side parsing is also more efficient since it doesn’t require the data to be converted to Python objects before creating the data buffer to be sent to the server.

The CAS.upload_file() method uploads a data file as-is to CAS and invokes the table.loadtable action in the background to parse it. Let’s use the same data file as the previous examples, but let the server do the parsing. Just as before, if a URL is specified, it must be downloaded to the client then uploaded CAS.

In [6]: cars = conn.upload_file('https://raw.githubusercontent.com/'
   ...:                         'sassoftware/sas-viya-programming/master/data/cars.csv')
   ...: 
NOTE: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSER(kesmit).
NOTE: The table CARS has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services.

In [7]: cars.head()
Out[7]: 
Selected Rows from Table CARS

    Make           Model   Type Origin   ...   MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia   ...          23.0  4451.0      106.0   189.0
1  Acura  RSX Type S 2dr  Sedan   Asia   ...          31.0  2778.0      101.0   172.0
2  Acura         TSX 4dr  Sedan   Asia   ...          29.0  3230.0      105.0   183.0
3  Acura          TL 4dr  Sedan   Asia   ...          28.0  3575.0      108.0   186.0
4  Acura      3.5 RL 4dr  Sedan   Asia   ...          24.0  3880.0      115.0   197.0

[5 rows x 15 columns]

Note

Loading data from the client-side will be slower than loading data files from the server. Client-side data loading is intended for smaller data sets.

Creating Custom Data Loaders¶

In addition to the pre-configured data loaders described above, it’s possible to write custom data loaders that can upload data to a CAS table from any type of file or stream of data that Python can handle. These are called “data message handlers” and are implemented in the CASDataMsgHandler class. The CASDataMsgHandler class can not be used directly, but is the base class for all data message handlers classes.

Note

Data message handlers are not supported in the REST interface.

To implement your own data message handler, you only need to implement two things in the subclass.

The variable definitions required in the vars= parameter of the table.addtable CAS action.
The getrow() method to return each row of data.

The variable definitions should be stored in the vars attribute of the subclass. The getrow() method returns a row of values (as defined by vars) for each requested row index. When there is not more data to return, None should be returned.

How you generate the vars attribute is really dependent on your situation. It can be inferred from a sample of the data or simply hard-coded. We’ll do a simple example using hard-coded variable definitions below.

In [8]: import swat.cas.datamsghandlers as dmh

In [9]: class MyDMH(dmh.CASDataMsgHandler):
   ...:     def __init__(self, data):
   ...:         self.data = data
   ...:         vars = [
   ...:             dict(name='Name', type='varchar'),
   ...:             dict(name='Age', type='int32'),
   ...:             dict(name='Height', type='double'),
   ...:             dict(name='Weight', type='double'),
   ...:         ]
   ...:         dmh.CASDataMsgHandler.__init__(self, vars)
   ...:     def getrow(self, index):
   ...:         try:
   ...:             return self.data[index]
   ...:         except IndexError:
   ...:             pass
   ...: 

In [10]: mydmh = MyDMH([
   ....:     ['Alfred', 13, 69, 112.5],
   ....:     ['Judy', 14, 64.3, 90],
   ....:     ['Robert', 12, 64.8, 128]
   ....: ])
   ....: 

The table.addtable CAS action call below uses a Python shortcut to pass a dictionary as keyword parameters. If you print **mydmh.arg.addtable, you’ll see the parameters that are getting passed to table.addtable. You don’t have to use this shortcut mode; you could construct the variable definitions and pass them in manually. However, you still need to pass the data message handler instance to the datamsghandler= argument.

In [11]: out = conn.addtable(table='Students', **mydmh.args.addtable)

In [12]: students = out.casTable

In [13]: students.columninfo()
Out[13]: 
[ColumnInfo]

    Column  ID     Type  RawLength  FormattedLength  NFL  NFD
 0    Name   1  varchar          6                6    0    0
 1     Age   2    int32          4               12    0    0
 2  Height   3   double          8               12    0    0
 3  Weight   4   double          8               12    0    0

+ Elapsed: 0.00188s, user: 0.000999s, mem: 0.763mb

In [14]: students.head()
Out[14]: 
Selected Rows from Table STUDENTS

     Name  Age  Height  Weight
0  Alfred   13    69.0   112.5
1    Judy   14    64.3    90.0
2  Robert   12    64.8   128.0

While this example uses an explicit list of data and passes that to the data message handler class to index into, it doesn’t have to be done this way. In fact, there is a DBAPI data message handler in the swat.cas.datamsghandlers module that takes a Python database connection that queries for the data that is returned by getrow().

Note

Loading data from the client-side will be slower than loading data files from the server. Client-side data loading is intended for smaller data sets.

Server-Side Data Files and Sources¶

Using Server-Side Parsers¶

If you have data files on the server, you can load them directly from the CASLib that they are in. Paths to files in a CASLib are always relative paths. This is the recommended method for large data files.

In [15]: cars = conn.load_path('data/cars.csv', caslib='casuser')

In [16]: cars.head()
Out[16]: 
Selected Rows from Table DATA.CARS

    Make           Model   Type Origin   ...   MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia   ...          23.0  4451.0      106.0   189.0
1  Acura  RSX Type S 2dr  Sedan   Asia   ...          31.0  2778.0      101.0   172.0
2  Acura         TSX 4dr  Sedan   Asia   ...          29.0  3230.0      105.0   183.0
3  Acura          TL 4dr  Sedan   Asia   ...          28.0  3575.0      108.0   186.0
4  Acura      3.5 RL 4dr  Sedan   Asia   ...          24.0  3880.0      115.0   197.0

[5 rows x 15 columns]

Loading Data from Other Sources¶

In addition to files, CAS has many other data loaders available to connect to sources such as databases. These other data sources require you to configure a CASLib that can connect and retrieve the data as a CAS table. For these other data sources, you would still use the CAS.load_path method, but rather than specifying a file path, you would specify the name of a resource in that data loader (such as a database table). This topic is beyond the scope of this document, but we are pointing it out in case you require this type of data access.