[page 82↓]

Prototypical development of a privacy-preserving Web analysis service

Companies’ data collection and analysis practices as described in Chapter 3 have increased users’ privacy concerns significantly, which is a major impediment for successful e-commerce. Privacy legislation has been implemented in many countries to alleviate some of these concerns. Moreover, site owners are increasingly adopting an industry standard for privacy protection – the Platform for Privacy Preferences (P3P) – that gives users more control over their personal information when visiting Web sites. The implications of these privacy requirements for our analysis framework from Chapter 3 will be discussed. This chapter will present a prototypical Web analysis service that calculates the analyses of Table 7-3 and indicates respective privacy requirements.

The chapter structure follows the main phases of the software development process [Sommerville, 2004]. Section 4.1 presents the main idea of the prototype’s business model. Section 4.2 concentrates on privacy requirements and their implications for the calculation of metrics and analytics in our analysis framework. Section4.3 presents the prototype design which, given a set of privacy constraints and available data elements, selects the Web analyses that are allowed to be calculated. The main functions and processes are presented. The specification of constraints arising from the specified privacy requirements is formulated as a syntactical extension to P3P. Section 4.4 presents the user interface. The main selection parameters and output formats are described. Section 4.5 discusses the implementation of the prototype.

Section 4.6 will briefly discuss how disallowed analyses could be modified in such a way that they return altered but still useful results without comprising privacy requirements. The goal is to reach a maximum amount of privacy to the customers while still allowing the site analyst to obtain valuable results.

4.1  Business model

The main function of our privacy-preserving analysis prototype is to calculate those Web analyses in our framework (cf. Table 7-3) that are not restricted, given a set of privacy constraints and data elements. Moreover, if a site is P3P-enabled the analysis service automatically parses the specifications about available data, purpose and jurisdiction and indicates potential restrictions when metrics and analytics are calculated.

The Web site owner can be located in any country. However, legal privacy restrictions are currently only specified for German retailers. If the site is not P3P-enabled manual specifications are required. The Web service business model is depicted in Figure 4-1:


[page 83↓]

Figure 4-1: The Web Service business model

Data can be exchanged by electronic transmission (e.g. by download) or on a physical storage medium. If the data collector has trust in the service provider and legislation does not restrict the use of certain data for analysis purposes, the complete data set can be transmitted without modifications. However, in the unlikely case that the analysis service is mistrusted (if so, it would be unlikely that the two parties engage in a business relationship) or legal policies restrict the use of certain data for analysis purposes, the retailer must protect confidential information before the data is transmitted. Methods to protect sensitive data are discussed in Section 4.6.

Note that the tool does not protect the consumer from deliberate privacy violations by the retailer or the service provider. It only supports the data analyst in calculating allowed analyses and recognizing potential privacy conflicts and possible usage purposes that must be respected. Thus, the business model requires that all parties must be trusted.

Standards for secure communication, e.g. a Secure Socket Layer (SSL) [Stallings, 1999], are integrated in the framework. Further security questions such as attacks from a malevolent hacker or employee are not the scope of this work.

The Web service could be offered for a per-service fee or as a renewable or permanent license. The business model could be enhanced by comparing analysis results between companies to create and sell benchmark reports for specific industries. In this case, further privacy measures have to be taken to protect shared data from misuse by third parties [cf. Boyens, 2004].

4.2  Privacy requirements

The following section will discuss privacy requirements and their implications for our analysis framework. Section 4.2.1 discusses privacy restrictions in German legislation. Section 4.2.2 presents the main specifications of P3P. Privacy problems from data inferences are discussed in Section 4.2.3. We will give examples of how inferences could bypass P3P specifications.

Implications from these requirements for the specification of our privacy-compliant analysis service are discussed and summarized in a problem statement in Section 4.2.4.


[page 84↓]

4.2.1  Legal restrictions

Laws protecting the privacy of individuals exist in more than 30 countries [Kobsa, 2002]. A number of regional, industry-specific and transnational regulations have been adopted in addition. It is beyond the scope of this thesis to discuss and compare privacy laws in different countries and their national and transnational implications in detail. Comprehensive resources are available for this purpose [e.g. www.epic.org, www.privacy.org, www.privacyinternational.org, www.privacyexchange.org, Agre and Rotenberg, 1997 Andrews, 2002 Rotenberg, 2001]. The legislative requirements in this section will focus on German privacy laws.

Directive 2002/58/EC of the European Parliament and the Council concerning the processing of personal data and the protection of privacy in the electronic communications sector [EU, 1995] and its extension for electronic data [EU, 2002] has been adopted in national laws in most European Union (EU) member states [EU, 1995]. In Germany, the EU Privacy Directive has been implemented in the Federal Data Protection Act [BDSG, 2003] and in the privacy laws of the German states [cf. EU, 1995]. For electronic services such as e-shops the Teleservices Data Protection Act [TDDSG, 2001] contains further, more specific regulations. TDDSG and BDSG regulate the collection, processing and usage of person-related data (§1 (2) BDSG, §1 (1) TDDSG). Person-related data is defined as information about identified or identifiable17 persons (§3 (1) BDSG). The TDDSG differentiates between “stock” data that is necessary for the reasons, contextual form and changes of a contractual relationship (§5 TDDSG) and “usage” data that is required for the usage of services (§6 (1) TDDSG). §3a BDSG imposes an obligation to collect data only in a sparing and avoidable way. A more detailed discussion of legal implications for e-commerce in Germany can be found in Hansen [2002].

The following sections discuss the main implications of German privacy laws on the analysis of user and usage data in our analysis framework. The main consequence of German privacy legislation for our analysis framework is that certain analyses are only allowed on pseudonymous data. Thus, analyses requiring identified data must be blocked by the analysis prototype. The requirements for the metrics and analytics are depicted in Table 7-3, where it is indicated whether an analysis requires identified or pseudonymous data. Privacy implications for three data types are discussed in the following sections for Web user data (4.2.1.1), Web usage data (4.2.1.2) and microgeographic data (4.2.1.3).


[page 85↓]

4.2.1.1  Web user data

Data collected for billing purposes in electronic retailing is person-related (§5 TDDSG) and must be modified by the e-shop before it can be transferred to the analysis service.

In our cooperation partner’s data schema (cf. Table 3-2), the (combinations of the) attributes name, surname, street, street_number,recipient_address, e-mail_address, date_of_birth and phone_number refer to identified or identifiable persons. Thus, all analyses that require attributes referring to identified or identifiable persons are disallowed according to German privacy legislation18.

As indicated in Table 0-3 some metrics and analytics in the analysis framework require at least a pseudonymous recognition of customers. Legislation explicitly allows the creation of pseudonymous user profiles for analysis purposes (§6 (3) TDDSG). Thus, if an analysis requires pseudonymous user data, all identifiable attributes such as name and surname should be replaced by a pseudonymous customer_id. It should be noted that linkage of pseudonymous user profiles with other attributes may lead to reidentification of customers. This problem will be discussed in more detail in Section 4.2.3. The pseudonymization should be performed by a trusted party in the company (e.g. the data protection officer).

German legislation requires the deletion of identifiable transaction data not later than six months after the time of data collection (§6 (4), §6 (7) TDDSG). In order to perform pseudonymous analyses over a longer period of time, the company should establish two separate databases: a “business intelligence” database where only pseudonymous information is stored for data analysis and a “transaction” database where billing data is stored for order fulfillment. The analysis service should have access only to the business intelligence database.

A technical approach to incorporating privacy policy enforcement into an existing application and database environment has been proposed in Le Fevre et al. [2004]. Agrawal et al. [2004] proposed an auditing framework that determines whether a database system is adhering to its data disclosure policies.


[page 86↓]

4.2.1.2  Web usage data

Web logs are considered person-related because user sessions contain the attribute ip_address or other attributes possibly indicating a user’s identity (e.g. login_name, user_authentication). In particular, users with a static IP address could possibly be identified.19 According to §6 (3) TDDSG, a pseudonymous analysis of Web usage data would be possible. Thus, the data collector should perform the following pseudonymization steps before the data is transferred to the analysis service:

If an ip_address is required for session reconstruction, it must be replaced by a pseudonymous ID. Moreover, the e-shop must delete or pseudonymize all attributes possibly indicating a user’s identity such as user_login or authentication before the log file is stored for analysis.

The ip_address is also required for the matching of localization and geographic information (cf. Section 3.1.1). This analysis would be illegal in German privacy legislation. The data collector could delete the last digits of the ip_address. However, this decreases the accuracy of IP localization tools significantly.

One should notice that the tables session and order could be combined via the attribute access_time, when the customer_id is assigned in consecutive time sequence. However, if all identity and identifiable attributes from the order table have been replaced by pseudonymous information, session data may remain anonymous and thus the analysis would comply with privacy legislation.

In order to recognize visitors in several sessions, cookies or login information are required [Berendt,et al., 2001]. The use of cookies, their settings and usage purposes should be explicitly communicated to the site users. The information stored in the cookie should not contain links to identified or identifiable information.

4.2.1.3  Microgeographic data

In contrast to the offline domain, where legislation has adopted a marketing-friendly privacy jurisdiction (cf. §28 BDSG), the TDDSG is more restrictive on the analysis and use of microgeographic data in the online domain. The combination of online billing [page 87↓]information and microgeographic data is illegal if a customer becomes identifiable [Weichert, 2004]. Thus, analyses that include online data in combination with microgeographic data are only legal if the user remains anonymous.

4.2.2  P3P specifications

Besides legal restrictions that are mandatory, companies can self-impose further restrictions on their data collection and data usage practices. A company’s privacy practices are typically posted as online privacy statements (also known as “privacy policies” or “privacy disclosures”).

A technical approach to codifying a company’s Web privacy practices is the Platform for Privacy Preferences (P3P). It enables Web sites to express their privacy practices in a standard XML (Extensible Markup Language) format that can be retrieved automatically and interpreted easily by user agents. P3P is an industry-supported, self-regulating approach to privacy protection. It has been recommended by the W3C [Cranor, et al., 2002] as a protocol to communicate how a site intends to collect, use, and share personal information about its visitors. P3P adoption is 33% for the top 100 Web sites and 22% for the top 500 Web sites [Ernst&Young, May 2004].

P3P-enabled browsers parse a site’s privacy policy automatically and compare it to the privacy preferences of the visitor, who can then decide to use the service or not. Once a P3P policy is set up on a Web site, it becomes a legally binding agreement predicated on notice and consent between the Web site and the user [Cranor,et al., 2002]. In the US, the Federal Trade Commission and several states have increasingly sued companies that did not adhere to their privacy policies for unfair and deceptive business practices.

P3P cannot constrain or modify existing privacy legislation. Thus, the use of P3P by itself does not constitute compliance with the EU Data Protection Directive, though it can be an important part of an overall compliance strategy [Cranor,et al., 2002]. The latest version of the P3P specification (Version 1.1 as of January 2005) includes a “jurisdiction” extension element where a known URL of a body of legislation can be inserted, which can be recognized by user agents.

The P3P 1.0 specification defines a base set of data elements a Web site may wish to collect, a standard set of uses, recipients and other privacy disclosures. A STATEMENT describes data practices that are applied to particular types of data. A STATEMENT element is a container that groups together a PURPOSE element, a RECIPIENT element, a RETENTION element, a DATA element, and optionally other information.

P3P is characterized by an “atomic” focus through its separate description of different [page 88↓]combinations of DATA, PURPOSE, RECIPIENT. This may lead to problems when data are combined. The implications of this problem for our analysis service are discussed in Section 4.2.4.

4.2.2.1 The DATA element of P3P

P3P provides a data schema built from a number of predefined data elements, which are specific data entities a service might typically collect (e.g. last name or telephone number).

The data schema in our privacy-preserving analysis tool parses the data elements specified in a P3P policy. Further data elements can be specified manually. Analyses are disabled if required data are not available.

4.2.2.2 The PURPOSE element of P3P

The description of the PURPOSE element requires site owners with P3P policies to explain and disclose the purpose of data collection for each DATA element or group that is collected. P3P suggests twelve standard purposes of data collection [P3P, 2002]. The PURPOSE specification for DATA elements does not restrict the calculation of analyses in our service tool. As discussed before, the use of the analysis results depends on the company’s business interests and cannot be controlled by our analysis service. However, the service automatically indicates for what purpose(s) DATA elements were collected, which reminds a Web site owner to use the analysis results only for the specified purpose(s).

4.2.2.3 The RECIPIENT element of P3P

P3P STATEMENTS must include a RECIPIENT element containing one or more recipients of the data. In order to assure a legal use of the analysis framework, the Web site owner should specify that the collected data is received by the data collector (<OURS>) and the analysis service provider that uses the data under equable practices (<SAME>).

4.2.2.4 The RETENTION element of P3P

A STATEMENT element must also include information of the data collector’s retention policy. In order to use the analysis service, the data collector should specify that data is retained for analysis purposes.


[page 89↓]

4.2.3  Inference problems

A problem that has not yet been directly addressed in P3P specifications is inferences from data that are re-combined after collection. Inferences20 exploit the possibility of intersecting separate releases of identified and unidentified data. Even if identity keys are not known, attributes from secondary data sources may doubtlessly point out a single person [Denning, 1982 Sweeney, 2001].

Related problems have been described by the methods of data and record linkage [e.g. Fellegi, 1972 Newcombe, et al., 1992 Winkler, 1995], pattern matching with aggregation operations [e.g. Torra, 2000] and object identification[Neiling, 2004].

Sweeney [2002] used publicly available information from a voter’s registry containing the data attributes name, age, gender and address. These attributes were compared with “anonymized” patient records from hospitals (where patient names had been deleted). Sweeney found that the attributes {date_of_birth, 5_digit_ZIP_code} identified 69% of the patients, {date_of_birth, gender} identified 29% and {date_of_birth} identified 12%.

Inference problems are also given for geomarketing, where customer attributes such as customer_id, gender, data_of_birth, credit_rating, zip_code, street_name, street_number, pages_visited, product_name are exchanged and matched with secondary demographic data. The matching would be legal in the online domain if the user profile remains pseudonymous. Deleting name and address would be a first step towards pseudonymization. An analysis would become privacy-critical, however, if a customer_id and zip_code are linked to the product_name ordered. In this case conclusions could be drawn from the customer’s preferences and residence. For example, a researcher who orders specialized books in his field of interest is likely to be identified by the zip code that indicates the location of his university or research institution. Especially for sparsely populated zip code areas – the smallest zip code data cell in our sample included 12 residents – the data miner could possibly find out who the customers are. Having the exact geographical location of a customer would be desirable to determine user profiles more accurately. However, this would infer privacy problems because precise coordinates could reveal a customer’s identity.

Inferences in our analysis framework depend on the secondary data that is available for [page 90↓]data linkage. Voter registration lists containing attributes such as name, zip_code and date_of_birth that were used by Sweeney [2001] to reidentify hospital patients are not publicly available in Germany because federal law prohibits access of third parties to ones voter registry (§17 I of the German Federal Electoral Law [BWahlG, 2005]). Thus, individuals in Germany are likely to be less affected by inference problems than those described in Sweeney [2001] due to limited access to external information.

In addition to identified inference problems, there is an inherent risk that future inference problems may impact a company’s analysis framework.

4.2.4  Problem statement

In summary, certain purposes are allowed in the analysis of certain data, and the data may be used for this purpose by certain recipients. The basic relational framework of P3P, however, is insufficient to account for inferences that may substitute certain data. Regardless of whether their use was permitted or not, data are available or not, and for each indicator, certain data are required. Legal regulations may restrict data usage. These relations constitute the problem specification for the analysis prototype (cf. Figure 4-2).

Figure 4-2: Problem specification

4.3  Design

This section presents the prototype design. Section 4.3.1 presents the main data types and relations used in the prototype. Section 4.3.2 describes the main functions and work processes. An extension of P3P for the privacy requirements presented in Section 4.2 is proposed.

4.3.1  Data types and relations

We distinguish between data the analyses computation is working on (input data) and data the process is working with (process data). The input data is formed by the Web log and Web user data (such as purchase, socio-economic, geographic and other data), together with the privacy policy the enterprise has adopted. Physically, this policy consists of the P3P file. The process data is the business logic that defines the whole analysis [page 91↓]process.

4.3.1.1 Input data

The input data describe the data items, purposes, recipients in Figure 4-2 as well as the relations between them. The input data consists of three sets: the set of basic data elements D, the set of purposes P and the set of recipients R. These are the same entities as those defined in P3P. Note that all these sets are enumerable:

D = {user, third party, business, dynamic} with every element again a set of data, as it is defined in Cranor et al. [2002]. Note that D can be extended by the issuer of the policy. Furthermore, we define for further use D set which is formed by sets of elements of D.

P is the set of the 12 relevant purposes as defined for the PURPOSE element, and R is the set of the six possible values for the RECIPIENT element. These two sets are not extensible.

The P3P STATEMENT establishes a relation between elements belonging to these three groups by assembling the DATA, PURPOSE and RECIPIENT elements.

4.3.1.2 Process data

The analysis framework introduces two new data entities for the process data. The first is the set of analyses I which is formed by all metrics and analytics that can be calculated from the present data. This set is fixed and not user-extensible. The second entity is the availability A = {true, false}. A indicates whether an instance of data is physically stored in the enterprise and can be made available to the analysis process. Note that this availability is defined purely technically. No privacy aspects are considered at this point.

4.3.1.3 Functional data relations

The functional data describe the relations between the availability, the analyses and the data items in Figure 4-2. Before analyzing the functional relations between the different data, we introduce our notation of functional relationship [Pepper, 2003]. A function f is a triple (Df, Wf, Rf), formed by a domain Df, a range Wf and a relation Rf, the function graph. This function graph has to be injective, i.e. there are no two pairs (a, b 1) ÎRf and (a, b 2) ÎRf with b 1¹b 2. The function f maps the argument value x to the result value y if the pair (x, y) is part of the function graph: (x, y) ÎRf. A given function f = (Df, Wf, Rf) is called partial if [page 92↓]p1(Rf) ÌDf, where p1 is the projection defined as p1(A´B) = A 21. Otherwise, i.e. if p1(Rf) = Df, f is called total.

Every statement in a given policy is an implicit function definition of a function h as: h: D ´ R ´ P ® {allowed}. The codomain of this implicit function is the one-element set {allowed}. This function is (usually) partial as not all purposes are allowed to everyone for all the data. In the following, we will totalize h by defining k as k( x ) = h( x ) if x ÎRh and k( x ) = {not allowed} otherwise. k is total.

Example: consider a statement as a fragment of a P3P file such as the following excerpt from Example 4.1 in Cranor et al. [2002]:

...

<STATEMENT>

<PURPOSE><individual-decision
required="optout"/></PURPOSE>

<RECIPIENT><ours/></RECIPIENT>

RETENTION><stated-purpose/></RETENTION>

<DATA-GROUP>

<DATA ref="#user.name.given"/>

<DATA ref="#dynamic.cookies">...</DATA>

</DATA-GROUP>

</STATEMENT>

...

This fragment defines the following elements of Rh:

( (user.name.given, ours, individual-decision), allowed )

( (dynamic.cookies, ours, individual-decision), allowed )

There are two more functions that establish relations:

The function requiredfor: D ´ I ® {true, false} defined on the data D and the analyses I states whether a data item is used within the calculation of an analysis. The function [page 93↓] isavailable: D ® A indicates whether a given data item is available. By definition, isavailable(<>) = true where <> indicates “no data”.

As all the sets are enumerable, these functions k, requiredfor and isavailable can be defined “point for point” for all elements. They are deterministic. Extensions of D require an extension of all three function graphs.

4.3.2  Functions and work processes

Given the set of all possible analyses, the subset “executable analyses” is the set of all the analyses that can be executed. We define each of our metrics and analytics in the analysis framework as a business analysis I. So, I Ê I executables = t(I) where the function t selects all the executable analyses from I (t acts as a filter). This section will provide the definition of t: I ®I.

Whether a given analysis i Î I is executable or not depends on two requirements: its execution has to be feasible and its execution must be allowed. With respect to an implementation of the framework, it is reasonable to check in this order because the check for technical requirements is usually simpler.

The technical requirements are (i) the presence of the definition of this analysis (the implementation has to know how to calculate it) – we presume that this is always guaranteed – and (ii) the presence of the data that is needed for its calculation, i.e. isavailable(d j )=true "d j : requiredfor(d j , i)=true.

The restrictions imposed by the privacy policy are expressed by k. The execution of an analysis i is allowed if k(d j , r, p)=allowed "d j : requiredfor(d j , i)=true where rÎR and pÎP have to be specified by the analyst.

There is no fixed relation between purpose and analysis, as the calculation of a given analysis can have multiple purposes. As discussed in Section 4.1, the lack of such relations is a serious problem for the privacy-compliant analysis of consumer data if one party is mistrusted. For each analysis result the prototype displays the P3P purpose as a relation of data attribute and specified purpose. For analyses combining data items with different usage specifications an alert message is also displayed.

We define t, the filter for the executable analyses, as a composition of functions already known (<> is “no analysis”):


[page 94↓]

In this consideration, we have assumed that a company stores its data with attribute names and level of aggregation as defined by the P3P base data schema. In real systems, this assumption is usually not fulfilled. Additional matching and aggregation or disaggregation of data have to be done. But as this is only a question of naming and storing, it has no impact on the theoretical process of decision making.

4.3.2.1  Impact of data inference on decision making

We define an inference as a function s: D set ® D. If there is an inference, then we can write s( {d 1 , d 2 , …, d n } ) = d n+1 with d i ¹ d j Û i¹ j. The existence of inferences is a problem for the decision on whether an analysis can be calculated or not. In particular, there is a problem for Rh.

Consider two data items for which the same restrictions on purpose and recipient apply: (d 1 , r 1 , p 1 ) and (d 2 , r 1 , p 1 ). Moreover, there is an inference so that s({d 1 ,d 2 })=d 3 . For d 3 , the following purpose limitation applies: (d 3 , r 1 , p 3 ). Consider an analysis that the recipient r 1 wants to use for the purpose p 1 which requires the data d 3 . Calculating this analysis by d 3 directly is prohibited by the P3P policy if the desired purpose is different from the allowed purpose (p 1 ¹ p 3 ). However, calculating the analysis from d 1 and d 2 is possible. Thus, inferences may bypass privacy restrictions.

The site user who accepted the policy is not protected against this violation of her privacy preferences – unless she employs a user agent that (i) is aware of this inference possibility and (ii) extends the usage restriction to also cover inferred data.

To achieve this goal, we propose an extension to P3P. Additional elements can be included into a policy by the element EXTENSION as defined in Cranor et al. [2002].

We suggest an unordered list of inference statements. Each INFERENCE statement consists of the data that can be inferred if a given set of data is present. A human-readable explanation can be added within the CONSEQUENCE element.

From a given premise, it may be possible to conclude n consequences. This is expressed as n separate INFERENCE statements, each with an atomic consequence. In addition, one may want to express an inference possibility such as “if d 1 and (d 2 or d 3 ) are given, then it is possible to infer d 4 ”. This may be split into two statements: “if d 1 and d 2 , then d 4 [page 95↓]and “if d 1 and d 3 , then d 4 ”. However, the introduction of the connector OR in addition to AND makes the formulation and reading of inferences easier for human users.

DATA-GROUPs can be placed within one of these elements to express logical relations between them. The following fragment shows an example.


[page 96↓]

User agents should parse these inferences. As this extension adds further restrictions to the policy, it is mandatory.

According to the W3C specification of P3P we define an INFERENCES extension using the Augmented Backus-Naur Form (ABNF) notation of [Crocker and Overel, ]. For simplicity, we abstain from an XML schema definition, even though a loss of flexibility has to be taken into account.


[page 97↓]

4.3.2.2 Coding legal restrictions in a P3P policy

As we have pointed out in Section 4.2.1, laws impose restrictions on using data. These restrictions are usually independent of recipient and purpose [EU, 2002]. Whereas the STATEMENTs in a policy file allow using the data within the specified borders, legal specifications always restrict uses. A priori, one can say that any legal restriction can be coded in a P3P policy by listing all allowed uses. Thus, the missing uses are prohibited. But this realization does not respect the simultaneity restriction: consider two data d 1 and d 2 that can be used by a given recipient r 1 for a given purpose p 1 . These separate uses are allowed by the laws and so may be listed in a P3P policy. But combining (i.e. simultaneous use) the same data for the same purpose is not allowed. This restriction cannot be coded by a P3P policy. Thus we suggest the introduction of a new element LEGAL that restricts combined usage in order to remedy this lack of P3P.

Within the LEGAL element several RESTRICTION elements can be specified. Each RESTRICTION can have four attributes; the introduction of additional attributes or values needs to be discussed. The ISSUER attribute specifies the name of the legal authority that codified the restriction, the LAW attribute contains the name (possibly shortened) of the legal norm which is the origin for this restriction. The values of both attributes are human-readable strings. The FOR-attribute indicates the region the site user must belong [page 98↓]to for this restriction to be applied. Possible values are comma-separated combinations of “all”, “EU”, and the ISO country abbreviations such as “US” for the United States of America, “GB” for the United Kingdom, or “DE” for Germany22. The default value is “all”. Finally, the non-value attribute “viceversa” summarizes the repetition of the same restriction with reversed WHILE and DON’T elements:

Within the RESTRICTION element, a CONSEQUENCE element can be defined, as it is defined in the P3P specification and also used for the extension by INFERENCE.

The main elements are WHILE and DONT. Both of them contain a single DATA-GROUP with one or more DATA elements. The use of all the DATA in the DONT element concurrently with the DATA in the WHILE element is not allowed. As this extension adds further restrictions to the policy that cannot be ignored, it is a mandatory extension.

The following fragment shows an example of the P3P extension using the LEGAL-element.


[page 99↓]

As the same legal restrictions apply for a large variety of Web sites, mechanisms to include a set of referenced legal restrictions hosted by a trusted provider (e.g. governmental authorities) should be developed as well.

4.3.2.3 Workflow

Figure 4-3 summarizes the processes within the framework, including the successive data exchanges and actions between the involved participants. The analysis provider’s task “identifies inferences, executes / disables analyses” is both an action and a restriction. For each exchange, its format is noted in an exemplary form. Interunit exchanges rely on standardized protocols and data description formats. Note that the framework includes the extensions for legal restrictions and inference problems.


[page 100↓]

Figure 4-3: Workflow

4.4  User interface

We have implemented a prototype based on the analysis framework proposed in Chapter 3. This section is reserved for a (non-complete) technical description of the prototype.

The analysis service has three specification phases. Currently, the specification has to be done manually. Future releases will support automated data retrieval and policy parsing. In each of the three phases, the analyst is told her specific task. Input errors are directly reported.

The first phase is the specification of the data the enterprise has stored: data availability is defined here. The second phase is the specification of the P3P privacy policy that applies to the data specified in the first step. The third phase is the selection of the analysis time frame and the desired analyses. The list of 82 metrics and analytics grouped in eight categories is presented. The user interface only shows the metrics and analytics that are allowed given available data and legal privacy restrictions. Other analyses are disabled and displayed in grayish color. The time frame (time interval of analysis) can be typed directly or chosen from a calendar control. Once an analysis has been chosen, a set of three output formats are proposed depending on the type of analysis: output as HTML, as XML or as an image. Images are generated dynamically using standard classes of the .NET Framework. The analyst can handle this image like all other images – she can save it, copy it, etc. Image formats (PNG, GIF, JPEG, BMP, TIFF, etc.), colors and fonts can be freely configured. The direct streaming avoids problems with asynchronous page request, image generation and image request. Moreover, there are no problems with temporary files. During our analyses based on the data of the multi-channel retailer, no time lags [page 101↓]were detected. The image generation “on the fly” does not slow down the output flush.

Figure 4-4 shows a screen shot of the analysis tool user interface (phase 3 of the specification process). In the background you can see a part of the analyses choice list with some choices disabled:

Figure 4-4: Main interface design with analyses choice list, privacy indication and time frame selection

4.5  Implementation

The prototype is a Web-based application written in C# in Microsoft .Net. The Web server dynamically generates Web pages to interact with the analyst who is not required to install additional client software. All browser types are supported as long as they support clientside ECMAScript (JScript or JavaScript).

Two databases are involved in the analysis process: the first is a Microsoft (MS) Access database providing the complete preprocessed Web data to be analyzed. The second is a MS SQL Server database that holds the process data.

According to the P3P Guiding Principles [Cranor,et al., 2002], measures have been taken to implement mechanisms for protecting any information that is transferred from the analyst to the tool and vice versa. HTTP over a high SSL encryption is used as a trusted [page 102↓]protocol for the secure transmission of data. Restrictive session timeouts prevent the abuse of foreign sessions. Analysts have to log on with a personal password, and temporary session cookies are used to prevent other analysts from “stealing” a session.

The system has been tested on data from the described multi-channel retailer. The application of the service framework on an online retailer’s consumer data indicated privacy problems in a real-world context. Potential inference problems and legislative privacy implications were identified and could be addressed within the framework.

4.6  Modification of analyses

In the hypothetical case that the analysis service is untrustworthy and the data collector wishes to protect the data before transfer, we briefly discuss possible protection measures. One solution for protecting sensitive data in a two-party business case – as described in Section 4.1 – is encryption techniques. The basic idea is to leave the data on the data collector’s server and transfer only encrypted data to the service provider [cf. Domingo-Ferrer and Herrera-Joancomarti, 1999 Rivest, et al., 1978]. Asonov and Freytag [2002] described a hardware-based approach to encryption using a secure coprocessor. Encryption functions are useful for a limited number of algorithmic operations such as addition, subtraction, multiplication and inverse multiplication and for basic database queries such as selection, projection and join [Boyens, 2004]. However, for more complex mining queries such as those in our analysis framework, encryption techniques are not suited.

Statistical disclosure control is an approach to minimizing privacy problems in databases [Agrawal and Srikant, 2000 Willenborg and Waal, 2001]. Statistical disclosure control can be broadly classified into the groups of query restriction and data perturbation [Agrawal and Srikant, 2000]. Using these techniques, data will be modified in such a way that the probability of reidentifying individual users can be kept below a selected threshold. Query restriction includes the restriction of the size of query results [Denning, et al., 1979 Fellegi, 1972], the control of overlap amongst successive queries [Dobkin, et al., 1979], the suppression of data cells of small size [Cox, 1980], and the clustering of entities into mutually exclusive atomic populations [Yu and Chin, 1977]. Perturbation techniquessuggest ways of adding noise to the data while maintaining some statistical invariant. Perturbation techniques include the swapping of values between records (Denning 1982), the replacement of the original data by a sample from the same distribution [Lefons, et al., 1983], the adding of noise to the results of a query [Beck, 1980], and the sampling of query results [Denning, 1982]. Both methods have advantages and disadvantages and none is an optimal solution: query restriction cannot completely avoid inferences but [page 103↓]provides valid responses. Perturbation techniques can prevent inferences but may not provide precise query results.

For our analysis framework, the following disclosure techniques are particularly useful for minimizing privacy problems:

  1. Limit access to the data, i.e. hide attributes that potentially identify data subjects (e.g. customer_id, address, email, exclude zip code areas with a small number of inhabitants)
  2. Aggregate the data, i.e. summarize the data in such a way that no conclusions can be drawn for a single subject, e.g. use zip codes instead of more fine-grained location data or replace the exact date_of_birth with the year_of_birth.
  3. Assign unique identifiers randomly, i.e. deploy primary keys that do not contain additional information about the subject they are pointing to, e.g. do not assign customer_id in consecutive order because it could possibly be linked with a person-related IP number.

A problem with these disclosure techniques could be a limited quality of the query results. For example, in the case of geomarketing, the shop obviously needs to make a trade-off between the preciseness of its results and the potential privacy violation of its users.

Moreover, inference opportunities could pose a privacy risk. Inference problems that are not known at the time of anonymization could inherently threaten user privacy in statistical disclosure control [Boyens, 2004].

4.7 Conclusion

A framework for deploying Web analyses has been set up and tested on data from a multi-channel retailer. We have determined the different data types that are involved in the data analysis process and established the functional relations between them. An automated way of filtering business analyses according to privacy restrictions has been presented. Due to our proposed extensions of the P3P specification, it is now possible to code both data inferences and legal usage restrictions.

We proposed approaches to modify the analyses in such a way that they could be transferred to an untrusted service provider.


Footnotes and Endnotes

17 When users can be identified with reasonable effort based on the data collected, privacy laws already apply.

18 However, the visitor can give explicit consent to the use of her data for analysis purposes according to §3 (2), §4 (2) TDDSG. Moreover, individual analyses may be allowed if they are required for the fulfillment of the transaction purpose. Thus, it may be possible to compile a “black” list of customers who frequently did not pay.

19 However, telecommunication service providers can store a user’s IP address and combine it with user data if it is required for billing purposes (§6 (2) TDDSG).

20 Often referred to as “reidentification” or “triangulation” problems

21 Analogously: p2(A´B) = B

22 http://www.iso.org



© Die inhaltliche Zusammenstellung und Aufmachung dieser Publikation sowie die elektronische Verarbeitung sind urheberrechtlich geschützt. Jede Verwertung, die nicht ausdrücklich vom Urheberrechtsgesetz zugelassen ist, bedarf der vorherigen Zustimmung. Das gilt insbesondere für die Vervielfältigung, die Bearbeitung und Einspeicherung und Verarbeitung in elektronische Systeme.
DiML DTD Version 4.0Zertifizierter Dokumentenserver
der Humboldt-Universität zu Berlin
HTML generated:
14.08.2006