1 Introduction

Hervé Cardot, Alain Dessertaine, Camelia Goga, Étienne Josserand and Pauline Lardin

Previous | Next

With the development of automated data acquisition processes at fine time scales, it is no longer unusual to have very large databases on phenomena that change over time. For example, in the coming years in France, approximately 30 million electric meters will be replaced by smart meters. These will be able to measure the consumption of each household and business at potentially very fine time scales (by the second or minute) and send the measurements once a day to a central server. Another example is measuring the viewership of different television channels. Boxes measure in continuous time information on whether the television is on and what channel is being viewed.

The statistical unit studied is accordingly a function (of time or space), which calls for the introduction of functional analysis tools. Although this branch of statistics has existed since the 1970s (Deville 1974), Dauxois and Pousse (1976), it truly developed during the 1990s with advances in computer technology. It has applications to various fields such as climatology, economics, remote sensing, medicine and quantitative chemistry. Readers may consult the recent references Ramsay and Silverman (2005) and Ferraty and Romain (2011) for a panorama of the different techniques and examples of applications.

When the potential databases are very large, it can be difficult and costly to collect, save and analyze the entire data set. Moreover, if one is interested in simple indicators such as the mean curve under constraints of memory space or the cost of transmission, the use of survey techniques to extract a sample can provide a precise estimate at a reasonable cost (Dessertaine 2008).

In the statistical literature, there are as yet few studies that combine functional data analysis and sampling theory. Cardot, Chaouch, Goga and Labruère (2010) are interested in using principal component analysis to reduce the dimension of the data, while Cardot and Josserand (2011) examine the uniform convergence properties of Horvitz-Thompson estimators of mean curves. Chaouch and Goga (2012) provide a robust estimator of central curves.

The objective of this study is to compare different sampling strategies in a functional context, using a real example. These real data concern the electricity consumption, measured every half hour for two weeks, of a test population of N=15,069 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9LqFf0x e9q8qqvqFr0dXdbrVc=b0P0xb9sq=fFfeu0RXxb9qr0dd9q8as0lf9 Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamOtai abg2da9iaaigdacaaI1aGaaiilaiaaicdacaaI2aGaaGyoaaaa@4040@  electric meters. The time profile of individuals' electricity use depends on covariables such as weather conditions (temperature, etc.) or geographic characteristics (altitude, latitude or longitude). Unfortunately, those variables are not available for this study, and we use only one variable as auxiliary information: the mean consumption from each meter during the previous week. This information can easily be transmitted by all the meters in the population.

Extending estimation methods that use auxiliary information to the functional framework is not always straightforward. Cardot and Josserand (2011) propose stratifying the population of curves to improve the estimate of the mean curve. Chaouch and Goga (2012), who are interested in the median curve, suggest using PPS (probability proportional to size) sampling with replacement as well as a post-stratified estimator. In this article, we propose to compare several strategies that take auxiliary information into account. The first strategy uses auxiliary information in selection of the sample: sampling with an unequal probabilities design (stratified, πps MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqiWdaNaam iCaiaadohaaaa@39A0@ ) and estimation with the Horvitz-Thompson estimator. The second strategy introduces this information at the estimation stage: a simple random sample is drawn without replacement and estimation is performed using a linear regression model (Särndal, Swensson and Wretman 1992) adapted to the functional framework (Faraway 1997).

A new question, related to the functional nature of the data, naturally arises: how to quantify sampling uncertainty? The construction of confidence intervals MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKk Fr0xfr=xfr=xb9adbaqaaeGacaGaaiaabeqaamaabeabaaGcbaacba qcLbwaqaaaaaaaaaWdbiaa=rbiaaa@39BE@ a central concern for survey methodologists MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKk Fr0xfr=xfr=xb9adbaqaaeGacaGaaiaabeqaamaabeabaaGcbaacba qcLbwaqaaaaaaaaaWdbiaa=rbiaaa@39BE@ has received little attention in the field of functional data statistics, where it is a matter of constructing confidence bands. Drawing on techniques based on estimation of the covariance function of the estimator (see Faraway (1997), Cuevas, Febrero and Fraiman (2006) or more recently Degras (2011)), we first propose to construct confidence bands by simulating Gaussian processes. An asymptotic justification of the validity of these techniques is given in Cardot, Degras and Josserand (2013) when the hypotheses of the central limit theorem are verified and there is a precise estimator of the covariance function. A second method of construction, which is based on bootstrap techniques, is also applied. It basically consists of three bootstrap techniques for use in a finite population: the bootstrap without replacement proposed by Gross (1980), the rescaling bootstrap (Rao and Wu 1988) and the mirror-match bootstrap (Sitter 1992). In this study, we use the bootstrap without replacement, which is based on adaptations for the stratified and PPS designs proposed by Chauvet (2007).

In Section 2, we introduce notations, estimators of the mean curve where there is auxiliary information, and estimators of their covariance function. The algorithms for constructing confidence bands, based on the bootstrap or simulation of Gaussian processes, are described in Section 3. Section 4 then compares the different strategies MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKk Fr0xfr=xfr=xb9adbaqaaeGacaGaaiaabeqaamaabeabaaGcbaacba qcLbwaqaaaaaaaaaWdbiaa=rbiaaa@39BE@ in terms of precision of the estimators, width and coverage of the confidence bands and computation time MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKk Fr0xfr=xfr=xb9adbaqaaeGacaGaaiaabeqaamaabeabaaGcbaacba qcLbwaqaaaaaaaaaWdbiaa=rbiaaa@39BE@ for purposes of estimating the consumption curves of the French electricity company EDF (Électricité de France). For this we use samples of size n=1,500 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOBaiabg2 da9iaaigdacaGGSaGaaGynaiaaicdacaaIWaaaaa@3B8D@  in our test population consisting of N=15,069 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOtaiabg2 da9iaaigdacaaI1aGaaiilaiaaicdacaaI2aGaaGyoaaaa@3C36@  curves. To finish, we present several perspectives of research in Section 5.

Previous | Next

Date modified: