23 Random Selection from a Dataset

Instructional video icon

You can select a random sample from an Rguroo dataset, with or without replacement, and replicate each sample. Moreover, you can apply statistic to each selected sample using R code.

23.1 Selecting Samples

Instructions to select random samples from a dataset

  1. Use a dataset in your Rguroo account or recreate the example below by importing the cardata dataset from the Rguroo dataset repository called Rguroo Users Guide into your account.
Click here to see a portion of the dataset. Screenshot of the first 5 rows of the Cardata dataset.
  1. Open the Probability-Simulation toolbox on the left-hand side of the Rguroo window. Use the Probability dropdown menu and choose Random Selection. The Dataset Random Selection dialog opens (see Figure 23.1).

  2. Select your dataset from the Dataset dropdown.

  3. Select your desired Sample Size, number of samples (Replications), and Seed.

  4. Select one of With or Without replacement.

  5. (Optional) If there is a numerical variable that consists of weights (probability of selection) for each case, select the variable from the Probability dropdown.

  • If no variable is selected, all cases to be sampled will have the same probability of getting selected.
  • The values of the probability variable must be all non-negative.
  • If the values of the probability don’t add up to one, they will be internally normalized to add up to one.
  1. (Option) In the Sample a Subset section, you can specify which rows and columns to sample from.
  • You can select rows using textboxes From –> To –> By or select rows by writing an R code in the Add Rows that results in specific row numbers. You can use both From –> To –> By and Add Rows at the same time.
  • You can select columns by writing an R code in the Columns that results in specific column numbers.
  • If length blank, all rows and columns will appear in the sample.

In our example, we select from the fourth (MPG) and fifth (HP) columns of the “cardata” dataset, and we only sample “Domestic” cars.

  1. Click the Preview icon preview icon to view the result.

  2. (Optional) You can save the result as a stand-alone dataset by typing a name in the Save Dataset As textbox and clicking on the Save Dataset As button.

Random Selection dialog

Figure 23.1: Random Selection Dialog

Output of random Selection

Figure 23.2: Output of random selection

23.2 Applying Statistics to Selected Samples

You can apply functions to your selected random samples by writing R code. In the example below, we write a function that creates a variable called Efficiency. For each sample selected, we compute the mean of MPG and depending on whether this mean is more than 30, between 20 and 30, or less than 30, the value of Efficiency is set as High, Average, or Low.

Instructions to apply statistics to selected samples

Continue the Rguroo instructions of Section 23.1.

  1. Click the Statistic button on the top right of the application. The Custom Statistic dialog opens (see Figure 23.3) .

  2. Click the plus icon plus button icon on the Custom Statistic dialog. In the textbox that appears, type in a variable name.

  3. Type your R code on the middle panel.

  • You can double-click the names of the variables to include in your code or type them in.
  • You can write multiple lines of code. However, the result of your code, when applied to each sample (replicate), must be a single number or character.
  1. Click the Preview icon preview icon to view the result.

  2. (Optional) You can save the result as a stand-alone dataset by typing a name in the Save Dataset As textbox and clicking on the Save Dataset As button.

Random selection dialog for computing statistics

Figure 23.3: Random Selection Dialog for Computing Statistics

Output of summary stats

Figure 23.4: Output of summary statistics

23.3 Stratified Random Sampling

You can select a stratified random sample from an Rguroo dataset, with or without replacement, by providing a stratification variable. The stratification variable must be a factor. Moreover, once a sample is selected, you can apply statistic to each selected sample using R code, as explained in Section 23.2.

Instructions to select a startified random sample proportional to the stratum size

  1. Use a dataset in your Rguroo account or recreate the example below by importing the cardata dataset from the Rguroo dataset repository called Rguroo Users Guide into your account.
Click here to see a portion of the dataset. Screenshot of the first 5 rows of the Cardata dataset.
  1. Open the Probability-Simulation toolbox on the left-hand side of the Rguroo window. Use the Probability dropdown menu and choose Random Selection. The Dataset Random Selection dialog opens (see Figure 23.5).

  2. Select your dataset from the Dataset dropdown.

  3. Select your desired Sample Size, number of samples (Replications), and Seed.

  4. Select one of With or Without replacement.

  5. (Optional) If there is a numerical variable that consists of weights (probability of selection) for each case, select the variable from the Probability dropdown.

  • If no variable is selected, all cases to be sampled will have the same probability of getting selected.
  • The values of the probability variable must be all non-negative.
  • If the values of the probability don’t add up to one, they will be internally normalized to add up to one for each stratum.
  1. In the Stratified Sample section, select the stratification variable from the Stratify by dropdown. The stratification variable must be a factor (categorical).

  2. There are two options of Equal and Proportional available for the stratified random sampling. With the option Equal each of the samples selected from each stratum have the same size that you specify. With the option Proportional, the sample size is proportional to the stratum size. Specifically, if \(n\) is the number specified in the Sample Size textbox, \(N\) is the total number of cases in the selected dataset, and \(N_i\) is the number of cases for stratum \(i\), then the number of cases selected from stratom \(i\) will be \(round(n * N_i/N)\).

In this example we select the variable TYPE as a stratification variable. This variable has two levels of Domestic (35 cases) and Import (47 cases). Since we selected \(n=5\), using the proportion option, we get 2 samples from Domestic and 3 samples from Import.

  1. When we use the stratified sampling options, we cannot select row subsets. However, selecting subset of columns is possible. In our example, we select from the fourth (MPG) and fifth (HP) columns of the “cardata” dataset which are in columns 4 and 5 of the dataset.

  2. Click the Preview icon preview icon to view the result.

  3. (Optional) You can save the result as a stand-alone dataset by typing a name in the Save Dataset As textbox and clicking on the Save Dataset As button.

Random Selection dialog for stratified random sampling

Figure 23.5: Random Selection Dialog for stratified random sampling

Output of random Selection

Figure 23.6: Output of random selection