Handling By Groups

If By groups are specified when running a CAS action, the result are returned with the following behaviors.

  1. A result key named ‘ByGroupInfo’ is returned with all of the By group variable values.
  2. Each By group table is returned in a separate result key with a prefix of ‘ByGroup#.’.

These behaviors can help when you have a large number of By groups and you want to process them as they arrive at the client rather than trying to hold the entire set of results in memory. However, when your result sets are smaller, you may want to combine all of the By group tables into a single pandas.DataFrame. To help in these situations, the CASResults class defines some helper methods for you.

Here is what it looks like to run a standard summary action, and a summary action with a By group specified. We will use this output to demonstrate the By group processing methods.

In [1]: tbl = tbl[['MSRP', 'Horsepower']]

In [2]: tbl.summary(subset=['Min', 'Max'])
Out[2]: 
[Summary]

 Descriptive Statistics for TMP_X94FHXF
 
        Column      Min       Max
 0        MSRP  10280.0  192465.0
 1  Horsepower     73.0     500.0

+ Elapsed: 0.0159s, user: 0.012s, sys: 0.007s, mem: 5.41mb

In [3]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max'])
Out[3]: 
[ByGroupInfo]

 ByGroupInfo
 
     Origin Origin_f  Cylinders   Cylinders_f               _key_
 0     Asia     Asia        NaN             .  Asia             .
 1     Asia     Asia        3.0             3  Asia             3
 2     Asia     Asia        4.0             4  Asia             4
 3     Asia     Asia        6.0             6  Asia             6
 4     Asia     Asia        8.0             8  Asia             8
 5   Europe   Europe        4.0             4  Europe           4
 6   Europe   Europe        5.0             5  Europe           5
 7   Europe   Europe        6.0             6  Europe           6
 8   Europe   Europe        8.0             8  Europe           8
 9   Europe   Europe       12.0            12  Europe          12
 10     USA      USA        4.0             4  USA              4
 11     USA      USA        6.0             6  USA              6
 12     USA      USA        8.0             8  USA              8
 13     USA      USA       10.0            10  USA             10

[ByGroup1.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   .                MSRP  25700.0  27200.0
        .          Horsepower    197.0    238.0

[ByGroup2.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   3                MSRP  19110.0  19110.0
        3          Horsepower     73.0     73.0

[ByGroup3.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   4                MSRP  10280.0  33260.0
        4          Horsepower     93.0    300.0

[ByGroup4.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   6                MSRP  16495.0  89765.0
        6          Horsepower    155.0    290.0

[ByGroup5.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   8                MSRP  26650.0  64800.0
        8          Horsepower    235.0    340.0

[ByGroup6.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Europe 4                MSRP  16999.0  43175.0
        4          Horsepower    100.0    250.0

[ByGroup7.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Europe 5                MSRP  31745.0  42565.0
        5          Horsepower    194.0    300.0

[ByGroup8.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 6                MSRP  23785.0  192465.0
        6          Horsepower    168.0     477.0

[ByGroup9.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 8                MSRP  39235.0  121770.0
        8          Horsepower    217.0     493.0

[ByGroup10.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 12               MSRP  75000.0  128420.0
        12         Horsepower    420.0     493.0

[ByGroup11.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    4                MSRP  10995.0  25955.0
        4          Horsepower    103.0    220.0

[ByGroup12.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    6                MSRP  17630.0  40720.0
        6          Horsepower    155.0    275.0

[ByGroup13.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    8                MSRP  22010.0  76200.0
        8          Horsepower    224.0    350.0

[ByGroup14.Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    10               MSRP  41475.0  81795.0
        10         Horsepower    310.0    500.0

+ Elapsed: 0.0232s, user: 0.02s, sys: 0.008s, mem: 6.58mb

Selecting Tables by Name Across By Groups

The CASResults.get_tables() method will return all tables with a given name across By groups in a list. If the results do not contain By groups, the single table with that name will be returned in a list. This makes it possible to use CASResults.get_tables() the same way whether By group variables are specified or not.

In [4]: tbl.summary(subset=['Min', 'Max']).get_tables('Summary')
Out[4]: 
[Descriptive Statistics for TMP_X94FHXF
 
        Column      Min       Max
 0        MSRP  10280.0  192465.0
 1  Horsepower     73.0     500.0]

In [5]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_tables('Summary')
Out[5]: 
[Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   .                MSRP  25700.0  27200.0
        .          Horsepower    197.0    238.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   3                MSRP  19110.0  19110.0
        3          Horsepower     73.0     73.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   4                MSRP  10280.0  33260.0
        4          Horsepower     93.0    300.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   6                MSRP  16495.0  89765.0
        6          Horsepower    155.0    290.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   8                MSRP  26650.0  64800.0
        8          Horsepower    235.0    340.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Europe 4                MSRP  16999.0  43175.0
        4          Horsepower    100.0    250.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Europe 5                MSRP  31745.0  42565.0
        5          Horsepower    194.0    300.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 6                MSRP  23785.0  192465.0
        6          Horsepower    168.0     477.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 8                MSRP  39235.0  121770.0
        8          Horsepower    217.0     493.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min       Max
 Origin Cylinders                               
 Europe 12               MSRP  75000.0  128420.0
        12         Horsepower    420.0     493.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    4                MSRP  10995.0  25955.0
        4          Horsepower    103.0    220.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    6                MSRP  17630.0  40720.0
        6          Horsepower    155.0    275.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    8                MSRP  22010.0  76200.0
        8          Horsepower    224.0    350.0,
 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 USA    10               MSRP  41475.0  81795.0
        10         Horsepower    310.0    500.0]

The reason that the table name is required is that many CAS actions can have multiple tables with different names in each By group.

Concatenating By Group Tables

While you can use the CASResults.get_tables() method to retrieve tables of a specified name then use the concat() function to concatenate them together, there is also a CASResults.concat_bygroups() method that you can use. This method will concatenate all tables with the same name across all By groups and set the concatenated table under the table’s key.

In [6]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).concat_bygroups()
Out[6]: 
[Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min       Max
 Origin Cylinders                               
 Asia   .                MSRP  25700.0   27200.0
        .          Horsepower    197.0     238.0
        3                MSRP  19110.0   19110.0
        3          Horsepower     73.0      73.0
        4                MSRP  10280.0   33260.0
        4          Horsepower     93.0     300.0
        6                MSRP  16495.0   89765.0
        6          Horsepower    155.0     290.0
        8                MSRP  26650.0   64800.0
        8          Horsepower    235.0     340.0
 Europe 4                MSRP  16999.0   43175.0
        4          Horsepower    100.0     250.0
        5                MSRP  31745.0   42565.0
        5          Horsepower    194.0     300.0
        6                MSRP  23785.0  192465.0
        6          Horsepower    168.0     477.0
        8                MSRP  39235.0  121770.0
        8          Horsepower    217.0     493.0
        12               MSRP  75000.0  128420.0
        12         Horsepower    420.0     493.0
 USA    4                MSRP  10995.0   25955.0
        4          Horsepower    103.0     220.0
        6                MSRP  17630.0   40720.0
        6          Horsepower    155.0     275.0
        8                MSRP  22010.0   76200.0
        8          Horsepower    224.0     350.0
        10               MSRP  41475.0   81795.0
        10         Horsepower    310.0     500.0

By default, this method returns a new CASResults object with the concatenated tables. If you want to modify the CASResults object in place, you can add a inplace=True option.

Selecting a Specific By Group

In addition to selecting tables by name, you can also select a specific By group in the result by specifying the By variable values. This is done with the CASResults.get_group() method.

In [7]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_group(['Asia', 4])
Out[7]: 
[Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   4                MSRP  10280.0  33260.0
        4          Horsepower     93.0    300.0

The values given for the By group variable values can be either the raw value or the formatted value.

You can also specify the grouping variables as keyword arguments.

In [8]: tbl.groupby(['Origin', 'Cylinders']).summary(subset=['Min', 'Max']).get_group(Origin='Asia', Cylinders=4)
Out[8]: 
[Summary]

 Descriptive Statistics for TMP_X94FHXF
 
                       Column      Min      Max
 Origin Cylinders                              
 Asia   4                MSRP  10280.0  33260.0
        4          Horsepower     93.0    300.0

Multiple Sets of By Groups

Some CAS actions like simple.mdsummary allow you to specify multiple By group sets. In cases like this, the keys for each By group set are prefixed with “ByGroupSet#.”. To select a By group set, you can use the CASResults.get_set() method. This takes a numeric index indicating which By group set to select. The return value is a new CASResults object that contains just the selected By group set. You can then use the methods above to select tables or concatenate tables together.

In [9]: tbl.mdsummary(sets=[dict(groupby=["Origin"]),
   ...:                     dict(groupby=["Cylinders"])])
   ...: 
Out[9]: 
[ByGroupSet1.ByGroupInfo]

 ByGroupSet1.ByGroupInfo
 
    Origin Origin_f   _key_
 0    Asia     Asia    Asia
 1  Europe   Europe  Europe
 2     USA      USA     USA

[ByGroupSet1.ByGroup1.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
             Column      Min      Max      ...              CV     TValue         ProbT
 Origin                                    ...                                         
 Asia          MSRP  10280.0  89765.0      ...       45.757738  27.470338  7.525437e-62
 Asia    Horsepower     73.0    340.0      ...       31.144121  40.360121  8.503586e-85
 
 [2 rows x 15 columns]

[ByGroupSet1.ByGroup2.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
             Column      Min       Max      ...              CV     TValue         ProbT
 Origin                                     ...                                         
 Europe        MSRP  16999.0  192465.0      ...       52.365474  21.179101  1.104072e-42
 Europe  Horsepower    100.0     493.0      ...       32.052683  34.600962  6.412879e-65
 
 [2 rows x 15 columns]

[ByGroupSet1.ByGroup3.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
             Column      Min      Max      ...              CV     TValue         ProbT
 Origin                                    ...                                         
 USA           MSRP  10995.0  81795.0      ...       41.272157  29.376598  3.686853e-63
 USA     Horsepower    103.0    500.0      ...       29.953802  40.476851  2.996113e-81
 
 [2 rows x 15 columns]

[ByGroupSet2.ByGroupInfo]

 ByGroupSet2.ByGroupInfo
 
    Cylinders   Cylinders_f         _key_
 0        NaN             .             .
 1        3.0             3             3
 2        4.0             4             4
 3        5.0             5             5
 4        6.0             6             6
 5        8.0             8             8
 6       10.0            10            10
 7       12.0            12            12

[ByGroupSet2.ByGroup1.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
                Column      Min      Max    N    ...           CSS         CV     TValue     ProbT
 Cylinders                                       ...                                              
 .                MSRP  25700.0  27200.0  2.0    ...     1125000.0   4.010057  35.266667  0.018047
 .          Horsepower    197.0    238.0  2.0    ...         840.5  13.329369  10.609756  0.059827
 
 [2 rows x 15 columns]

[ByGroupSet2.ByGroup2.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
                Column      Min      Max    N  NMiss  ...            USS  CSS  CV  TValue  ProbT
 Cylinders                                            ...                                       
 3                MSRP  19110.0  19110.0  1.0    0.0  ...    365192100.0  0.0 NaN     NaN    NaN
 3          Horsepower     73.0     73.0  1.0    0.0  ...         5329.0  0.0 NaN     NaN    NaN
 
 [2 rows x 15 columns]

[ByGroupSet2.ByGroup3.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
                Column      Min      Max      ...              CV     TValue         ProbT
 Cylinders                                    ...                                         
 4                MSRP  10280.0  43175.0      ...       35.806222  32.569489  8.276602e-66
 4          Horsepower     93.0    300.0      ...       24.175595  48.238332  5.424646e-87
 
 [2 rows x 15 columns]

[ByGroupSet2.ByGroup4.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
                Column      Min      Max      ...              CV     TValue         ProbT
 Cylinders                                    ...                                         
 5                MSRP  31745.0  42565.0      ...        9.833327  26.905963  1.741013e-07
 5          Horsepower    194.0    300.0      ...       16.660895  15.880007  3.957170e-06
 
 [2 rows x 15 columns]

[ByGroupSet2.ByGroup5.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
                Column      Min       Max      ...               CV     TValue          ProbT
 Cylinders                                     ...                                           
 6                MSRP  16495.0  192465.0      ...        50.094946  27.515847   5.123923e-68
 6          Horsepower    155.0     477.0      ...        18.924015  72.838924  2.971778e-140
 
 [2 rows x 15 columns]

[ByGroupSet2.ByGroup6.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
                Column      Min       Max      ...              CV     TValue         ProbT
 Cylinders                                     ...                                         
 8                MSRP  22010.0  121770.0      ...       36.610383  25.477414  7.769015e-42
 8          Horsepower    217.0     493.0      ...       16.186567  57.624195  1.709083e-70
 
 [2 rows x 15 columns]

[ByGroupSet2.ByGroup7.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
                Column      Min      Max    N    ...             CSS         CV    TValue     ProbT
 Cylinders                                       ...                                               
 10               MSRP  41475.0  81795.0  2.0    ...     812851200.0  46.257071  3.057292  0.201247
 10         Horsepower    310.0    500.0  2.0    ...         18050.0  33.172911  4.263158  0.146679
 
 [2 rows x 15 columns]

[ByGroupSet2.ByGroup8.MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
                Column      Min       Max    ...            CV     TValue     ProbT
 Cylinders                                   ...                                   
 12               MSRP  75000.0  128420.0    ...     27.582919   6.279433  0.024435
 12         Horsepower    420.0     493.0    ...      8.992867  19.260274  0.002685
 
 [2 rows x 15 columns]

+ Elapsed: 0.0673s, user: 0.054s, sys: 0.032s, mem: 36.4mb

In [10]: tbl.mdsummary(sets=[dict(groupby=["Origin"]),
   ....:                     dict(groupby=["Cylinders"])]).get_set(1).get_group('Asia')
   ....: 
Out[10]: 
[MDSummary]

 Descriptive Statistics for TMP_X94FHXF
 
             Column      Min      Max      ...              CV     TValue         ProbT
 Origin                                    ...                                         
 Asia          MSRP  10280.0  89765.0      ...       45.757738  27.470338  7.525437e-62
 Asia    Horsepower     73.0    340.0      ...       31.144121  40.360121  8.503586e-85
 
 [2 rows x 15 columns]