The archival appraisal of records containing personal information: A RAMP study with guidelines
 4. Appraisal methodologies, criteria, and options
 Introduction The comprehensive appraisal method Appraising case files: general working rules Appraising case files: specific criteria Appraising case files: practical and preservation issues Appraisal Options Sampling: a summary profile Conclusion Notes

### Sampling: a summary profile

30. It is not the purpose of this study to investigate the various sampling methodologies in detail nor to review actual relevant sampling cases from archival practice. A RAMP study has been published on sampling and readers should consult it for more details and particular examples.16 The aim here is merely to give a brief summary of sampling as an appraisal option for acquiring records containing personal information. Some will argue that only random or statistical sampling is true sampling and that the other means cited below are better termed selection. However, unlike the example, they all attempt to represent some or all of the characteristics of the whole (or of some feature of the whole) in the part chosen, and for this reason are here termed "sampling." While there are many sampling methods, most usually fit *to one (or a combination) of the following four categories.17

31. Statistical Sampling. Selection based on mathematical techniques that determine the proper number of cases (i.e., size of the sample) and the actual means of selecting specific cases necessary to preserve a "representative" (statistically valid) sample of the entire series. This is sometimes called probability sampling.

Example: Selection based on random number tables, or an automated random number generator, and then pulling the required files matching the randomly identified numbers. There are three types of statistical or random sampling: simple random (where the random numbers are applied blindly to the entire population, which sometimes means small pockets of files of a particular type may be missed entirely); systematic random (where the first number is chosen randomly, and then every nth number thereafter is chosen, which is particularly helpful for chronologically organized series and for avoiding the "missing pockets" syndrome); and stratified random (where the whole is broken down into logical parts - like the categories in the United States Justice litigation case files cited above - and then each part or office is randomly sampled, thus ensuring that no part is overlooked).18

Advantages: The sample can be used to reconstruct the whole and the results should be statistically valid. It is theoretically unbiased and thus easily explainable to researchers. For a numerical arrangement of files, it may be a relatively easy sample to pull by clerical staff. Finally, archivists can control the size of the sample, and normally it will be quite manageable, since even for large series, the proper statistical weight can be assigned, even when a relatively small sample is chosen (about a maximum of 1,500 total cases is NARA's experience out of any size series, whether from ten thousand or ten million cases).

Disadvantages: There is obviously little chance that the few exceptional or outstanding cases in the series will be included in the random or statistical sample, although this can be compensated for by using a second method (see below) to complement the random sample. As well, researchers cannot do longitudinal work; one cannot trace a county or individual over time, as the county or person in every likelihood will not be selected for every annual or decennial random sample from the series. For files arranged alphabetically or in some other non-numerical scheme, the statistical sample is very difficult to pull, as it will require the counting and often may require the costly numbering of all the files before pulling. And for complex file series, there may be the need for a stratified (i.e., multiple) sample to ensure that various types of actions are sampled; this is very expensive and requires great statistical expertise. A high level of analysis is also required to determine the homogeneity of the series and the nature of the features or characteristics within the files which must each be given statistical weight. Archivists naturally should not be afraid of complex analysis nor of acquiring new expertise, but only cautious that the time thus spent to determine these factors does not pass the point of diminishing returns. As well, in that the total information universe is rarely known to the archivist for large series of continuing files perhaps scattered in hundreds of field offices, it is somewhat difficult and always expensive to apply statistical sampling techniques: one can, of course, sample each office separately, or add up the total number of files in each office in order to determine the whole before beginning the random sample. More difficult to determine (and defend) is when to sample: every year and on what date in it?), every tenth year, every twenty-fifth, etc. Finally, and most problematic, for continuing series organized without logical cutoff points, which is the case for the majority of operational programmes, the open-ended nature of the records system means that the information universe is unknowable. The first files in the series will be ready for destruction or preservation long before the last file is even begun. An unknowable information universe renders impossible statistically valid sampling; only for closed or contained series is it relevant. Of course, the archivist may impose cut-off dates (in order to "close" an "open" series), or do statistical samples at certain time intervals on all closed volumes of files accumulated to that point. But with such tactics, the size of the total sample remains uncontrollable, and for the whole series (when it eventually stops) there is no assurance that the sum of the some twenty or fifty samples taken on parts of the series over the years is equal (and statistically valid) to the one hypothetical (but impossible to perform) sample of the entire series.

32. Systematic Sampling. Selection based on a physical characteristic of the records or filing scheme without regard to the substantive information in the selected files.

Examples: All files from years ending in "2" or for surnames beginning with. "F"; every twentieth or nth file; every social insurance or identity number ending in "5"; all files measuring more than one inch thick or more than x volumes or sections in format (the so-called "fat file" method), which of course will vary from series to series as to what is "fat."

Advantages: The sample is relatively easy to pull, and does not require great expertise in the substantive content covered by the file. It can thus be pulled relatively inexpensively by clerks in records offices or records centres, rather than needing the direct (and costly) intervention of archivists or senior programme officials in departments with their knowledge of the substance of the files. As noted earlier (see 4.20 above), if using the l' fat file" method, the chances are good that the archivist may get most of the real problem cases.

Disadvantages: This method is not statistically valid; it cannot be used to reconstruct the whole. It is difficult to explain to researchers (i.e., to justify saving years ending in "2" rather than "7," and so on). It is impossible to control the size of the sample (especially with the "fat file" method) and thus space planning is very difficult. And, quite evidently, this method (with the partial exception of the "fat file" approach) does not guarantee that the outstanding or controversial cases have been preserved. Conversely, the fat file approach will likely result in preserving the cases which were the exception, not the rule.

33. Exemplary Sampling. Selection made on a qualitative basis to document some "typical" characteristic, activity, or time period.

Examples: All files from a particular region to show how a "typical" field office operated; or all files from the years immediately before and after an agency reorganization or significant legislative change to show their impact on actual operations; or all files for particular types of court proceedings (e.g., felony convictions); or all files for public servants reaching the rank of director or above. As another example, the archivist could also keep all series for many agencies for a very intensive geographical area (a small region or city) which is typical of the whole nation in order to take a snapshot of the societal image. The "fat file" may also be an exemplary sample, even though its physical characteristic places tit first under the systematic sampling category (see 4.32 above). If the file is "fat" because some consistent characteristic or feature of the programme renders it so (rather than just its physical size), then if that feature or characteristic is the one the archivist deems worthy of preservation for qualitative rather than purely quantitative reasons, the fat file method is also an exemplary sample.

Advantages: The method can be justified to researchers, although with some difficulty, and it can be used to trace a programme over time.

Disadvantages: The method is not statistically valid and cannot be used to reconstruct the whole. It does not save the exceptional cases and again there is no control over the size of the sample. It does require substantive expertise to make the right choices, as "typicality" of the isolated feature or characteristic or the time period will always be open to dispute, and therefore will require the archivist's careful analysis and explanation.

34. Exceptional Sampling. Selection of files on significant individuals, precedent-setting programmes, and landmark cases.

Examples: Pulling of exceptional individual cases from a series follows different criteria in each case. However, there are several types of individuals to watch for and these were stated generally in section 4.19 above, and a particular example was given for personnel files in section 2.27 of how to isolate the famous, controversial, and "firsts" from the ordinary and routine cases. Once again, depending on the reasons for unusually large files, the "fat file" method for some series may also indicate exceptional, precedent-setting cases.

Advantages: This selection can be justified, although with some difficulty, as it usually saves the controversial files that often demanded by researchers.

Disadvantages: The method is obviously not statistically valid, and may give a false impression of what the original whole series was like (i.e., distort the view of a "typical" case). It requires great substantive expertise as well as relatively good prior identification and arrangement of files so that the exceptional cases can be located and pulled. It is very closely linked to current research trends, and therefore highly susceptible to bias. Again, the size of the sample cannot be controlled.

35. Stratified sampling uses the same sampling method to acquire two or more samples from the same series in order to protect different characteristics of the whole: lower courts and appeal courts; field and regional offices; different income levels; or whatever other strata into which it seems useful to divide the files.

36. The archivist can also combine two or more of the above methods, where appropriate. If random or statistical sampling is one of the methods, it must of course be applied first so that the statistical validity of the whole is not impaired. It may be desirable to use statistical or systematic sampling first, and then search for an exemplary or exceptional sample second.