Indexing and Data Selection

Indexing of CASTable objects works much in the same way as they do in pandas.DataFrame objects. You can select one or more columns based on column names or indexes, and you can select slices of columns. However, data selection does have some limitations. CAS tables can be distributed across a grid of computers and they do not have a specified order. Because of this, indexing based on a row index is not possible at this time. However, it is possible to apply where clauses to a the table parameters to filter rows based on that.

There are a few properties that allow indexing a CASTable object in various ways. These properties work just like they pandas.DataFrame counterparts (with the limitations described above).

Property / Method Description
o[columns] Subset table based on column names
o.loc[:, columns] Subset table based on column names
o.iloc[:, columns] Subset table based on column indexes
o.ix[:, columns] Subset table based on mixed column names and indexes
o.xs(column, axis=1) Select a cross-section of the table
o[boolean-column] Filter data rows based on boolean column values
o.query(‘expr’) Apply a filter to the data values

The Basics

Just as with pandas.DataFrames, CASTable objects implement Python’s __getitem__ method to allow indexing using [ ]. This allows you to subset the columns that are visible in the table.

In [1]: tbl = conn.read_csv('https://raw.githubusercontent.com/'
   ...:                     'sassoftware/sas-viya-programming/master/data/cars.csv')
   ...: 
NOTE: Cloud Analytic Services made the uploaded file available as table TMPXUAOS9_X in caslib CASUSER(kesmit).
NOTE: The table TMPXUAOS9_X has been created in caslib CASUSER(kesmit) from binary data uploaded to Cloud Analytic Services.

In [2]: tbl.head()
Out[2]: 
Selected Rows from Table TMPXUAOS9_X

    Make           Model   Type Origin   ...   MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia   ...          23.0  4451.0      106.0   189.0
1  Acura  RSX Type S 2dr  Sedan   Asia   ...          31.0  2778.0      101.0   172.0
2  Acura         TSX 4dr  Sedan   Asia   ...          29.0  3230.0      105.0   183.0
3  Acura          TL 4dr  Sedan   Asia   ...          28.0  3575.0      108.0   186.0
4  Acura      3.5 RL 4dr  Sedan   Asia   ...          24.0  3880.0      115.0   197.0

[5 rows x 15 columns]

Here we are selecting a single column from the table. This will return a CASColumn object.

In [3]: tbl['Make'].head()
Out[3]: 
0    Acura
1    Acura
2    Acura
3    Acura
4    Acura
Name: Make, dtype: object

Selecting multiple columns returns a new CASTable object.

In [4]: tbl[['Make', 'Model', 'Horsepower']].head()
Out[4]: 
Selected Rows from Table TMPXUAOS9_X

    Make           Model  Horsepower
0  Acura             MDX       265.0
1  Acura  RSX Type S 2dr       200.0
2  Acura         TSX 4dr       200.0
3  Acura          TL 4dr       270.0
4  Acura      3.5 RL 4dr       225.0

You can also access individual columns using attribute syntax.

In [5]: tbl.Make.head()
Out[5]: 
0    Acura
1    Acura
2    Acura
3    Acura
4    Acura
Name: Make, dtype: object

Caution should be used when using attribute syntax because it depends on the fact that there are no existing attributes, methods, or CAS actions with that same name on the CASTable. It also requires that the column name contains a valid Python identifier. Since CAS actions can be added dynamically, attribute access should generally only be used in interactive programming. For programs that will be reused, it is safer to use the [ ] syntax.

Selecting by Name

The loc property is used to select columns based on the column names. Column names can be specified as a string, a list of strings, or a slice. If a string is given, a CASColumn is returned. If a list of strings or a slice is specified, a CASTable is returned.

A single string selects a column. Since row selection is not supported at this time, this is equivalent to tbl.loc['Make'].

In [6]: tbl.loc[:, 'Make'].head()
Out[6]: 
0    Acura
1    Acura
2    Acura
3    Acura
4    Acura
Name: Make, dtype: object

Using a list of strings selects those columns and returns a new CASTable object. Again, this is equivalent to tbl[['Make', 'Model']].

In [7]: tbl.loc[:, ['Make', 'Model']].head()
Out[7]: 
Selected Rows from Table TMPXUAOS9_X

    Make           Model
0  Acura             MDX
1  Acura  RSX Type S 2dr
2  Acura         TSX 4dr
3  Acura          TL 4dr
4  Acura      3.5 RL 4dr

Slicing using column names allows you to select a range of columns.

In [8]: tbl.loc[:, 'Model':'Invoice'].head()
Out[8]: 
Selected Rows from Table TMPXUAOS9_X

            Model   Type Origin DriveTrain     MSRP  Invoice
0             MDX    SUV   Asia        All  36945.0  33337.0
1  RSX Type S 2dr  Sedan   Asia      Front  23820.0  21761.0
2         TSX 4dr  Sedan   Asia      Front  26990.0  24647.0
3          TL 4dr  Sedan   Asia      Front  33195.0  30299.0
4      3.5 RL 4dr  Sedan   Asia      Front  43755.0  39014.0

You can even specify a step size.

In [9]: tbl.loc[:, 'Model':'Invoice':2].head()
Out[9]: 
Selected Rows from Table TMPXUAOS9_X

            Model Origin     MSRP
0             MDX   Asia  36945.0
1  RSX Type S 2dr   Asia  23820.0
2         TSX 4dr   Asia  26990.0
3          TL 4dr   Asia  33195.0
4      3.5 RL 4dr   Asia  43755.0

Note that when using columns names in slices, both endpoints are included in the slice. This is not the same behavior for numeric indexes, but is consistent with the way that slicing works in pandas.DataFrame objects.

Selecting by Position

The iloc property is used to select columns based on column indices. Just like with loc, the column indices can be specified as a single integer, a list of integers, or a slice.

In [10]: tbl.iloc[:, 1].head()
Out[10]: 
0               MDX
1    RSX Type S 2dr
2           TSX 4dr
3            TL 4dr
4        3.5 RL 4dr
Name: Model, dtype: object

Using a list of integers returns a new CASTable object.

In [11]: tbl.iloc[:, [1, 5, 3]].head()
Out[11]: 
Selected Rows from Table TMPXUAOS9_X

            Model     MSRP Origin
0             MDX  36945.0   Asia
1  RSX Type S 2dr  23820.0   Asia
2         TSX 4dr  26990.0   Asia
3          TL 4dr  33195.0   Asia
4      3.5 RL 4dr  43755.0   Asia

Of course, ranges work here as well, with or without a step size.

In [12]: tbl.iloc[:, 2:6].head()
Out[12]: 
Selected Rows from Table TMPXUAOS9_X

    Type Origin DriveTrain     MSRP
0    SUV   Asia        All  36945.0
1  Sedan   Asia      Front  23820.0
2  Sedan   Asia      Front  26990.0
3  Sedan   Asia      Front  33195.0
4  Sedan   Asia      Front  43755.0
In [13]: tbl.iloc[:, 6:2:-2].head()
Out[13]: 
Selected Rows from Table TMPXUAOS9_X

    Make           Model   Type Origin   ...   MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia   ...          23.0  4451.0      106.0   189.0
1  Acura  RSX Type S 2dr  Sedan   Asia   ...          31.0  2778.0      101.0   172.0
2  Acura         TSX 4dr  Sedan   Asia   ...          29.0  3230.0      105.0   183.0
3  Acura          TL 4dr  Sedan   Asia   ...          28.0  3575.0      108.0   186.0
4  Acura      3.5 RL 4dr  Sedan   Asia   ...          24.0  3880.0      115.0   197.0

[5 rows x 15 columns]

Mixing Names and Position

The ix property works just like the loc and iloc properties except that it takes a mix of column names and indexes.

In [14]: tbl.ix[:, 'Model'].head()
Out[14]: 
0               MDX
1    RSX Type S 2dr
2           TSX 4dr
3            TL 4dr
4        3.5 RL 4dr
Name: Model, dtype: object

In [15]: tbl.ix[:, 3].head()
Out[15]: 
0    Asia
1    Asia
2    Asia
3    Asia
4    Asia
Name: Origin, dtype: object
In [16]: tbl.ix[:, ['Model', 4, 3]].head()
Out[16]: 
Selected Rows from Table TMPXUAOS9_X

            Model DriveTrain Origin
0             MDX        All   Asia
1  RSX Type S 2dr      Front   Asia
2         TSX 4dr      Front   Asia
3          TL 4dr      Front   Asia
4      3.5 RL 4dr      Front   Asia
In [17]: tbl.ix[:, 'Model':6:2].head()
Out[17]: 
Selected Rows from Table TMPXUAOS9_X

            Model Origin     MSRP
0             MDX   Asia  36945.0
1  RSX Type S 2dr   Asia  23820.0
2         TSX 4dr   Asia  26990.0
3          TL 4dr   Asia  33195.0
4      3.5 RL 4dr   Asia  43755.0

Selecting a Cross Section

The xs method currently only supports column selection (i.e., axis=1). It is primarily here for future development.

In [18]: tbl.xs('Model', axis=1).head()
Out[18]: 
0               MDX
1    RSX Type S 2dr
2           TSX 4dr
3            TL 4dr
4        3.5 RL 4dr
Name: Model, dtype: object

Boolean Indexing

It is possible to use a CASColumn as a way to select rows in a CAS table. The CASColumn should contain values that are valid booleans to CAS (typically integer values where 0 is false and non-zero is true).

Here is a basic example that selects all cars with an MSRP value over 80,000.

In [19]: tbl[tbl.MSRP > 80000].head()
Out[19]: 
Selected Rows from Table TMPXUAOS9_X

     Make                         Model    Type   ...    Weight Wheelbase  Length
0   Acura        NSX coupe 2dr manual S  Sports   ...    3153.0     100.0   174.0
1    Audi                      RS 6 4dr  Sports   ...    4024.0     109.0   191.0
2   Dodge  Viper SRT-10 convertible 2dr  Sports   ...    3410.0      99.0   176.0
3  Jaguar                 XKR coupe 2dr  Sports   ...    3865.0     102.0   187.0
4  Jaguar           XKR convertible 2dr  Sports   ...    4042.0     102.0   187.0

[5 rows x 15 columns]

Conditions can be combined with | for or, & for and, and ~ for not. However, due to the order of precedence in Python, you must put your comparisons operations in parentheses before combining them with these operators.

In [20]: tbl[(tbl.MSRP > 80000) & (tbl.Horsepower > 400)].head()
Out[20]: 
Selected Rows from Table TMPXUAOS9_X

            Make                         Model    Type   ...    Weight Wheelbase  Length
0           Audi                      RS 6 4dr  Sports   ...    4024.0     109.0   191.0
1          Dodge  Viper SRT-10 convertible 2dr  Sports   ...    3410.0      99.0   176.0
2  Mercedes-Benz                     CL600 2dr   Sedan   ...    4473.0     114.0   196.0
3  Mercedes-Benz                  SL55 AMG 2dr  Sports   ...    4235.0     101.0   179.0
4  Mercedes-Benz         SL600 convertible 2dr  Sports   ...    4429.0     101.0   179.0

[5 rows x 15 columns]

Since each mask of a CASTable object returns a new CASTable object, you can split operations across multiple steps.

In [21]: expensive = tbl[tbl.MSRP > 80000]

In [22]: expensive[expensive.Horsepower > 400].head()
Out[22]: 
Selected Rows from Table TMPXUAOS9_X

            Make                         Model    Type   ...    Weight Wheelbase  Length
0           Audi                      RS 6 4dr  Sports   ...    4024.0     109.0   191.0
1          Dodge  Viper SRT-10 convertible 2dr  Sports   ...    3410.0      99.0   176.0
2  Mercedes-Benz                     CL600 2dr   Sedan   ...    4473.0     114.0   196.0
3  Mercedes-Benz                  SL55 AMG 2dr  Sports   ...    4235.0     101.0   179.0
4  Mercedes-Benz         SL600 convertible 2dr  Sports   ...    4429.0     101.0   179.0

[5 rows x 15 columns]

Warning

You can only use columns from within the same CAS table in boolean operations. If you want to combine operations across tables, you should create a view that contains all of the data, then use the filtering features outlined above on that view.

The query Method

Rather than using the boolean data selection described above, you can write a CAS where expression and apply it to a CASTable object directly using the CASTable.query() method. This can often result in more readable code when using longer expressions.

In [23]: tbl.query('MSRP > 80000 and Horsepower > 400').head()
Out[23]: 
Selected Rows from Table TMPXUAOS9_X

            Make                         Model    Type   ...    Weight Wheelbase  Length
0           Audi                      RS 6 4dr  Sports   ...    4024.0     109.0   191.0
1          Dodge  Viper SRT-10 convertible 2dr  Sports   ...    3410.0      99.0   176.0
2  Mercedes-Benz                     CL600 2dr   Sedan   ...    4473.0     114.0   196.0
3  Mercedes-Benz                  SL55 AMG 2dr  Sports   ...    4235.0     101.0   179.0
4  Mercedes-Benz         SL600 convertible 2dr  Sports   ...    4429.0     101.0   179.0

[5 rows x 15 columns]

Of course, queries can be combined across multiple steps as well.

In [24]: expensive = tbl.query('MSRP > 80000')

In [25]: expensive.query('Horsepower > 400').head()
Out[25]: 
Selected Rows from Table TMPXUAOS9_X

            Make                         Model    Type   ...    Weight Wheelbase  Length
0           Audi                      RS 6 4dr  Sports   ...    4024.0     109.0   191.0
1          Dodge  Viper SRT-10 convertible 2dr  Sports   ...    3410.0      99.0   176.0
2  Mercedes-Benz                     CL600 2dr   Sedan   ...    4473.0     114.0   196.0
3  Mercedes-Benz                  SL55 AMG 2dr  Sports   ...    4235.0     101.0   179.0
4  Mercedes-Benz         SL600 convertible 2dr  Sports   ...    4429.0     101.0   179.0

[5 rows x 15 columns]