| ↓9 |
|
The human animal needs a freedom seldom mentioned, freedom from intrusion. He needs a little privacy.
|
| ↓10 |
A service that is provided via any kind of wired or mobile network is called net-based service [Tamm and Günther, 2004]. When these services are provided using technologies recommended by the World Wide Web Consortium2 (such as HTTP [W3C, 1997]) and using the underlying transport/network protocol TCP/IP [DARPA, 1981; DARPA, 1981] , we speak of web-based services. Net-based services that do not follow these standards include mobile services and services that are based on proprietary technologies (such as internal company networks). Note that because of their relative novelty, no unambiguous definition has been established yet. For reasons of simplicity, we will from now on use the terms net-based and web-based services synonymously and denominate them as the latter.
An important subtype of web-based services are web services who are defined as "software systems identified by uniform resource identifiers, whose public interfaces and bindings are defined and described using XML. Its definition can be discovered by other software systems. These systems may then interact with the web service in a manner prescribed by its definition, using XML based messages conveyed by Internet protocols" [W3C, 2004]. These services do not require a graphical user interface and are mainly used to improve the communcation and interoperability between applications.
| ↓11 |
Complementary to this technological definition, web-based services can also be classified following their business model. For example, an Application Service Provider (ASP) "deploys, hosts and manages access to a packaged application to multiple parties from a centrally managed facility. The applications are delivered over networks on a subscription basis. This delivery model speeds implementation, minimizes the expenses and risks incurred across the application life cycle, and overcomes the chronic shortage of qualified technical personnel available in-house" [IDC, 1999]. Usually, the ASP charges a flat fee per user from its client. We will refer to this business model further in Section 3.1.
For the purposes of this thesis, we speak of a (web-based) service S when a service provider SP stores, modifies, processes, publishes or forwards some confidential input data D given by the data holder (see Figure 2-1). We assume that the service provider creates a service result S(D) that is either returned to the data holder (2-party case) or forwarded to an authorized third party (3-party case). In the 2-party-case, the data holder is also the service user, see the solid service line in Figure 2-1. In the 3-party-case, this is not necessarily the case, as an external third party is involved, as depicted by the dashed line in Figure 2-1.
| Figure 2-1: Input data and provided service | ||
| ↓12 |
The input data D can be delivered reactively, e.g. by explicitly specifying weight and blood pressure for an online health check. Or it can be delivered non-reactively / passively, e.g. when a physician forwards patient data to a health service institution for research purposes.
Roles of the service provider include online stores (www.amazon.com), online service providers (such as financial services, e.g. www.citibank.com), data mining providers (e.g. www.datashaping.com) or regional health initiatives (e.g. www.prhi.org).
Exemplary roles of the third party include direct marketing companies, medical researchers, patients or yet another service provider.
| ↓13 |
Based on the data holder's trust level of trust towards the service provider and towards the third party, we can distinguish four different privacy constellations. Table 2-1 illustrates this. For a detailed discussion of what influences trust in web-based services, see [Jarvenpaa, et al., 2000]3.
Table 2-1: The four privacy constellations depending on the data holder's trust
|
Trust in 3rd party |
|||
|
Yes |
No |
||
|
Trust in |
Yes |
Uncritical |
Transform / protect S(D) Chapter 4 |
|
No |
Transform / protect D Chapter 3 |
Transform / protect D and S(D) Chapter 4.1.4 |
|
The straight-forward case is when trust exists both towards the service provider and towards the 3rd party. Privacy is not an issue.
| ↓14 |
When there is no third party or the third party can be trusted, the data holder only has to make sure that his confidential data D is protected against potential misuse at the service provider's site. This case usually applies for the use of online services that require the input of confidential data such as income data or personal health data, and is explained in Chapter 3.
Things become more difficult if an untrusted third party comes in. This may be the case in the sketched example of a regional health initiative that collects and analyzes primary care data (this is the service S) and distributes health reports to their community (the 3rd party that is not necessarily trusted). This case is addressed in Chapter 4.
The most delicate case occurs when neither the service provider nor the third party can be trusted. The opportunities to deliver useful service results in this case are very limited. Consider e.g. online voting [Asonov, et al., 2001]. Confidential data (the vote) is passed to the service provider (the state or another official voting institution) who aggregates the votes and publishes the election result to the public (i.e., the 3rd party). A voter would want to keep his vote secret both towards the state and towards the public. We elaborate shortly on this case in Section 4.1.4.
| ↓15 |
The data holder uses a service that is offered online (web-based service) that requires him to send input data to the service provider. The result of the service is returned to the data holder who is the service user at the same time (cf. Figure 2-2). A simple example is the query for specific share values. The name of the share (e.g. 'MERQ' for Mercury Interactive Group) is the input datum D, the result of the service request S(D) is ($42.46 24-Mar-04 3:58pm). Besides this basic kind of database query, there are more complex services such as wage accounting or online health checks. This case is particularly characterized by the absence of a third party, i.e. the data holder is simultaneously the service user and only has to protect his confidential data from the service provider. We assume for this case that the privacy policy of the service provider explicitly rules out the forwarding of customer / user information to a third party.
| Figure 2-2: Confidential data flow in 2-party services | ||
Note that in this case too, both input datum D and service result S(D) have to be protected from an untrusted service provider. Yet we will show in Section 3.3 that a transformation of D is sufficient to protect both D and S(D).
| ↓16 |
Although there are many application areas for the 2-party case, we will motivate our work with two important instances.
The single-user web-based service refers to the well-known case of a single person using popular web-based services e.g. when searching for information (e.g. at www.google.com), buying digital goods (e.g. at www.amazon.com) or managing financial assets (e.g. at www.citibank.com). A concerned user who is very hesitant with sharing confidential information with the service provider may ask the following questions.
| ↓17 |
Although at least some of these tasks sound infeasible we will show in Chapter 3 that for a selected range of applications, we can indeed obtain a desired service result without sharing confidential input information with the service provider.
Opposed to the single-user web-based service, the outsourced web-based service refers to an application service provider (ASP) who offers "software as a service" usually to an entire company. An ASP "deploys, hosts and manages access to a packaged application for its customers from a centrally managed facility" (see [IDC, 1999] for a definition). Contracting an ASP promises its customers to reduce capital investment, to make IT costs more transparent, to facilitate the focus on core competencies and to provide faster access to high-end software applications [Tamm, 2003]. However, this also means that an ASP always hosts the (potentially confidential) business data of the customer with the corresponding implications for data security and privacy. An extensive survey by [Carter, 2000] shows that this innovative kind of software provision is severely inhibited by privacy concerns of ASP customers. We elaborate on these concerns in Section 3.2 and show how this conflict can be resolved for some particular applications.
Database services and arithmetic operations are core components of web-based services. Several approaches have been created to address related privacy problems in database service provider architectures.
| ↓18 |
One promising research direction is Private Information Retrieval (PIR) which was first presented by [Chor, et al., 1995]. It allows clients to query a database server, revealing neither the query nor the result of the query to the server. The model is simpler than that of traditional relational databases [Codd, 1970] because the query consists of an array index and the answer is the contents of the indexed array field. Yet it is powerful enough to implement many different applications such as file systems and dictionaries. If there is only one database server, the proven most efficient algorithm that preserves the secrecy of the query is for the user to download the entire database for each query. There are more efficient algorithms for several servers that do not cooperate in an attack. More recently, a practical method assuming a trusted physical device (a secure coprocessor, see [Smith and Weingart, 1999])in the server host has been proposed by [Asonov and Freytag, 2002]. This approach is almost optimal in resource overhead, and the assumption that the coprocessor on the server site is not compromised is arguably weaker than the assumption that several servers do not cooperate in an attack. PIR-algorithms can be used as a building block in privacy-preserving database outsourcing methods, see [Fischmann and Günther, 2003].
[Hacigumus, et al., 2002; Hacigumus, et al., 2002] present an approach that allows relational database operations on encrypted data. Before encryption, the data is aggregated to partitions on the customer side to decrease the amount of information that the service provider receives. The consequence is that the client needs to do some post-processing, as the provider can compute only an approximation of the result which can also be error-prone. Unfortunately, it is only efficient on a subset of relational algebra. For instance, each range query condition of the form A < x needs to be transformed into a disjunction of conditions matching all concrete values smaller than x before being encrypted. Also, as [Fischmann and Günther, 2003] show, even with aggregation this scheme is not very secure.
[Damiani, et al., 2003] take a different route along the same line of reasoning. Instead of aggregating the data, each plaintext attribute is properly encrypted to a unique ciphertext, and a method is proposed to compute exposure coefficients that tell the data holder how much information he is giving away. Furthermore, it is explained how B-trees [Bayer and McCreight, 1972] can be encrypted to allow for more efficient range queries on encrypted tables. However, even encrypting the B-trees for retrieving ranges of records only helps improve performance with respect to the naive approach, but information on the attribute in question is still leaked. Each time all records are retrieved that satisfy A < x, the service provider learns a set of ciphertexts that represent values of A that are smaller than x.
| ↓19 |
[Song, et al., 2000] have proposed a family of schemes for encrypting a text corpus such that it can be searched without decryption. These methods are efficient and proven secure, and certainly an interesting building block for privacy-preserving application distribution.
As essential theoretical foundations we should mention secure multi-party computation [Goldreich, 1998], a generalization of privacy-preserving data mining and oblivious transfer [Naor and Pinkas, 2001], a more rigid category of protocols related to private information retrieval, although neither is the subject of this work.
We can see that the privacy-preserving use of arithmetic operations and database services have been elaborated in more or less disjoint research fields. However, we believe that both arithmetic and database operations are core components for almost every web-based service and should be analyzed and developed jointly.
| ↓20 |
In Chapter 3 we present a comprehensive analysis of what kind of services are feasible given the security requirements of the data holders. We explore how database and arithmetic operations can be usefully combined to offer securely outsourced services. Obviously, the extent of services that can be conducted on encrypted data is limited and not plentiful enough for arbitrary use. Our aim is to explore the trade-off between "not secure enough" and "not useful enough".
To motivate our work, we show how the confidential data of an ASP customer can be compromised. We propose a service architecture that hides plain data from the service provider and we carry out sample services within this framework. We evaluate our framework with regard to time and memory requirements and discuss practical implementation issues.
In a 3-party service architecture, the service user is not necessarily the data holder. This is the case in the scenario we sketched in the introduction, where a regional health initiative (the service provider) collects and analyzes data from patients (the data holders) to distribute results to medical researchers or, of course, to the patients (the service users). Figure 2-3 displays this.
| ↓21 |
| Figure 2-3: Confidential data flow in 3-party services | ||
The main threat in this scenario is that from the published report (i.e. the service result), confidential information about individuals can be inferred. The problem of obtaining the confidential datum D from the publicly available service result S(D) is the inference problem well-known in the statistical disclosure control (SDC) literature (for comprehensive surveys on SDC, see [Adam and Wortman, 1989; Shoshani, 1982]).
We will now give two examples for the 3-party service case.
| ↓22 |
Census bureaus such as the U.S. Bureau of Census (www.census.gov) collect data and provide statistics to the public in order to characterize regions socially and economically. The main threat for the data holders is the risk of being re-identified in the published statistic, thus divulging confidential information such as income or debt.
Regional health initiatives such as the Pittsburgh Regional Healthcare Initiative (PRHI, www.prhi.org) collect, analyze and disseminate chronic disease data to patients and researchers. Data holders include pharmacies, physicians, health maintenance organizations (HMOs), laboratories and patients. Besides the re-identification threat for patients, the other parties also fear a breach of their privacy. An HMO for instance may fear that internal data ends up in the hands of a competitor and is (mis-)used in marketing campaigns. We will elaborate on this case extensively in Chapter 4.
Data integration deals with the technical integration of heterogeneous data sources. It is necessary for instance, when different HMOs provide data about the same diagnostic test in different formats. [Wiederhold, 1993] introduced the notion of a mediator between data holders and data providers to resolve semantic conflicts, and later added a security component to his basic model [Wiederhold,et al., 1996]. [Rezgui, et al., 2002] proposed a privacy mediator based on the screening of external database queries for sensitive attributes and their eventual removal from processed queries. Complementary to the mediator approach is the data warehouse approach [Chaudhuri and Dayal, 1997; Inmon, 1996], where data from the different sources is extracted, transformed and then loaded into a read-only database. Queries are then no longer run on the multiple databases via the mediator but directly on the data warehouse database. For reasons of simplicity, we will call the intermediary system "mediator" from now on because it best incorporates the idea of negotiating between data holders and service users.
| ↓23 |
Statistical disclosure control (SDC) is concerned with providing access to high-quality statistics for business / policy purposes (i.e., the service S)while at the same time protecting the confidentiality of the individual data providers (i.e. the data holders, for instance survey or Census respondents). SDC distinguishes two principle approaches, query restriction and data perturbation. The query restriction family includes the following.
Query set size control [Fellegi, 1972] works by setting lower and upper bounds for the size of the query answer set based on the properties of the database and on the preferences fixed by the database administrator. If the number of returned records does not lie within these bounds, the information request would have to be rejected and the query answer is denied. As queries that are issued sequentially by one user often have a large number of entities in common, an improvement is the restriction of these entities to a maximum number, see [Dobkin, et al., 1979]. Although popular, this method is not robust enough as a stand-alone solution, see [Denning, 1982].
Auditing involves keeping up-to-date logs of all queries made by each user and constantly checking for possible disclosures whenever a new query is issued. One major drawback of this method is that it requires huge amounts of storage and CPU time to keep these logs updated. A well-known implementation of such an audit system is Audit Expert by [Chin and Özsoyoglu, 1982]. It uses binary matrices to indicate whether or not a record was involved in a query.
| ↓24 |
Cell suppression [Cox, 1980] is an important method for categorical databases when information is published in tabular form. Census Bureaus often make use of tabular data and publish counts of individuals based on different categories. One of the main privacy objectives is to avoid answers of a small size. For example, if a snooper knows somebody's residence, age and employer, he can issue a query for (ZIP=10178, Age= 57, Employer= 'ABC'). If the answer is one entity, the snooper could go on and query for (ZIP= 10178, Age= 57, Employer= 'ABC', Diagnosis= 'Depression'). If the answer is one again, the database is compromised and the person with the diagnosis identified. The cells must be suppressed. A common criterion to decide whether or not to suppress a cell is the N-k rule where a cell is suppressed if the top N respondents contribute at least k% of the cell total. N and k are parameters that are fixed by the database administrator, i.e. the Census Bureau. In the exemplary case of N= 2 and k= 10%, a cell which indicates aggregated income ($10M) of 100 individuals would have to be suppressed if the top two earners’ aggregate income exceeded $1M.
In the query restriction approach, either exact data is delivered from the original database or the query is denied. An alternative is to perturb the original values such that confidential, individual data become useless for a snooper while the statistical properties of the attribute are preserved. The manipulated data is stored in a second database and is then freely accessible for the users.
Data swapping [Denning, 1982] is the process of exchanging attribute values (like income) within a list of data holders such that no assignments from individuals to their income can be made anymore. However, the arithmetic average and the standard deviation of the attribute income stay the same.
| ↓25 |
Noise addition for numerical attributes [Traub, et al., 1984] means adding a disturbing term to each value: Y k = X k +e k , where X k is the original value and e k adheres to a given probability distribution with mean zero. As for every value X k , the perturbation e k is fixed; therefore conducting multiple queries does not refine the snooper's search for confidential single values.
A hybrid approach are random-sample queries [Denning, 1980] where a sample is drawn from the query set in such a way that each entity of the complete set is included in the sample with probability P. If, for example, the sample of a COUNT query has n entities, then the size of the not perturbed query set can be estimated as n/P. If P is large, there should be a set-size restriction to avoid small query sets where all entities are included.
Another prominent approach to rule out re-identification in public databases is k-anonymity [Sweeney, 2001; Sweeney, 2002; Sweeney, 2002]. This is an anonymization procedure after which each individual cannot be distinguished from another (k-1) individuals in the database. Thus every set of attribute values appears at least k times. As k increases, so too does the anonymity in the database.
| ↓26 |
Data Mining, the science of efficiently discovering valuable, non-obvious information from large databases, may well be misused to intrude upon the privacy of organizations and individuals [Clifton and Marks, 1996]. One research direction is the use of cryptographic protocols to calculate aggregates from several contributors without divulging individual information [Canny, 2002; Canny, 2002; Clifton, 2001; Lindell and Pinkas, 2000; Vaidya and Clifton, 2002].
[Agrawal and Srikant, 2000] present a new method for privacy-preserving data mining. Based on a database of perturbed values, they are able to reconstruct the original value distribution. Confidential information of individuals is not compromised.
We propose a new class of privacy mediators that go beyond the state of the art with regard to the privacy protection methods applied to the final service result S(D). We propose extending the common query-rewriting approach by developing and integrating new methods of disclosure control. Our proposals for preventing inferences are not limited to mediator-based approaches and are applicable in the context of data warehousing and statistical databases as well, see Section 4.1.3.
| ↓27 |
To motivate our work, we use a real-world example that shows how a snooping HMO is able to determine narrow bounds on confidential information of its competitors by analyzing the aggregate data published by the mediator. We propose a specific "audit and aggregate" methodology that helps detect and prevent this specific kind of disclosure. Furthermore, we evaluate our method within a framework that measures the trade-off between decreasing risk of privacy breaches on behalf of the data provider and the loss of information for the legitimate service user.
After distinguishing between the two-party and the three-party case in the preceding sections, we will now introduce another dimension that is important in the privacy context. Users of web-based services can provide data either reactively to the service provider or non-reactively / passively (see e.g. [Boyens, et al., 2002]).
Reactive data provision means that data holders explicitly provide their personal data (e.g. for their health data in an online health check). They actively fill in web forms or web questionnaires and submit the information to the service provider.
| ↓28 |
Non-reactive or passive data provision means that data holders do not explicitly provide personal data but generate it automatically e.g. through their behavior. The best-known case for this is the tracking of web-shop customers with the help of cookies, a file stored on the user's hard disk that identifies the customer each time he visits a web site. For a more detailed description of cookies and their role in privacy protection see [EPIC, 2004] or www.cookiecentral.com.
For the case of Internet services, [Teltzrow and Kobsa, 2004] distinguish between "user data" and "usage data", where user data refers to demographic data, user skills and knowledge, user interests and plans whereas usage data refers link selections, viewing behavior and purchase actions. Usually, online service users are much less aware of their passive data provision because it is often triggered by their web browser settings. For an overview of reactive and passive data provision, see Table 2-2.
Table 2-2: Data provision in different services (means of data provision in parenthesis)
|
Data provision |
|||
|
Reactive |
Passive |
||
|
Service |
2-party |
Online health check |
Personalization services |
|
3-party |
Census |
Chronic disease reports |
|
| ↓29 |
Allowing the employment of cookies does not only yield privacy compromises. It also yields benefits in the form of personalized services. These services can include customized finance pages or news collections, customized recommendations or advertisements based on past purchase behavior, customized pricing, express transactions or tailored email alerts. In online book stores for instance, these personalized services include recommendations to recently published books of interest. The recommendations are based on clicking behavior and on preceding book or CD purchases. Unfortunately, many online stores do not offer an option to (de-)activate the tracking of the click and purchase history and thereby prevent the user from easily trading off his own privacy concerns with the potential benefit from an extended service [Kobsa, 2001].
In Table 2-3 and Table 2-4 we give a short and, of course, incomplete list of examples of services with their respective fit into the dimensions number of parties and reactive vs. passive data provision. For each service, we name examples of sensitive data that is typically required as input data, and we name some major threats that these data may be subject to.
Table 2-3: Typical services in 2-party architectures
|
2-party services |
||||
|
Service |
Sample service provider |
Data provision |
Sensitive data di |
Major threats / criticality |
|
Health check |
reactive |
age |
Forwarding of personal health information to 3rd parties |
|
|
Online store |
passive |
hobbies |
"Profiling" of customers, tracking via cookies |
|
|
ASP service |
reactive |
birth date |
Disclosure of personal / corporate secrets (product launch dates) |
|
| ↓30 |
Table 2-4: Typical services in 3-party architectures
|
3 -party services |
||||
|
Service |
Sample service provider |
Data provision |
Sensitive data di |
Major threats / criticality |
|
Census |
reactive |
name |
Re-identification of individuals, disclosure of e.g. income |
|
|
Online voting |
reactive |
election vote |
Disclosure of the confidential vote either to the state or to fellow citizens |
|
|
Health reports |
passive |
address |
Re-identification of individuals / disclosure of e.g. confidential test values |
|
There are several areas of research that are outside the scope of this thesis. The most important one is technical data security. We assume that adequate measures for communication confidentiality (such as the Secure Socket Layer SSL [Netscape, 1996]) and access control (such as Kerberos [Neumann and Ts'o, 1994] or X.509 [ITU, 2000]) are in place. For a detailed discussion of cryptography and network security, see [Schneier, 1996; Stallings, 1999].
Another important area is the extension of privacy legislation. For purposes of this work we assume the validity of the current privacy laws and recommendations in place, in particular [HIPAA, 1996; USPA, 1974] for the USA and [EU, 1995; EU, 2002] for the European Union.
| ↓31 |
A very important area of investigation of an individual's attitude towards privacy [Ackerman,et al., 1999] and the behavior that is derived from this attitude (which often differs significantly from the attitude declared beforehand, see [Spiekermann,et al., 2001]). Our work focuses on technical measures that are not affected by individual privacy behavior.
2 www.w3.org
3 [Jarvenpaa et al., 2000] found that the reputation and the perceived size have a significant impact on the trust in internet stores
| © Die inhaltliche Zusammenstellung und Aufmachung dieser Publikation sowie die elektronische Verarbeitung sind urheberrechtlich geschützt. Jede Verwertung, die nicht ausdrücklich vom Urheberrechtsgesetz zugelassen ist, bedarf der vorherigen Zustimmung. Das gilt insbesondere für die Vervielfältigung, die Bearbeitung und Einspeicherung und Verarbeitung in elektronische Systeme. | ||
| DiML DTD Version 4.0 | Zertifizierter Dokumentenserver der Humboldt-Universität zu Berlin | HTML generated: 22.01.2007 |