PASMAC: Privacy and Security for MapReduce Clouds

Despite the hype, only a few large enterprises or governmental organizations outsource their services to public clouds. On the one hand, cloud computing has been identified as one of the top ten business strategies as it offers many advantages such as greater flexibility and reduced costs. However, on the other hand, outsourcing and, therewith, relinquishing control of services implies many security and privacy problems. Cloud providers often place their data centers in foreign countries where security policies are difficult to enforce, cloud infrastructures are threatened by hackers, insiders, and even malicious customers trying to peek into or tamper with outsourced data. Consequently, the cloud cannot be trusted. As of today, lack of security and privacy guarantees are major adoption obstacles for both, large enterprises and governmental organizations.

Our research, PASMAC, targets the design and evaluation of protocols for secure and privacy- preserving “data analysis” in an untrusted cloud. With PASMAC, the user can store and query data in the cloud, preserving privacy and integrity of outsourced data and queries. PASMAC specifically addresses a real-world cloud framework: Google’s prominent MapReduce paradigm. PASMAC will design and prototype new protocols based on highly parallelizable, efficient privacy-preserving techniques, such as efficient private information retrieval, encrypted Bloom filters, and additive homomorphic encryption.

Research in PASMAC is supported by NSF grant 1218197.

PIRMAP

Efficient Private Information Retrieval for MapReduce

One often overlooked problem with outsourcing data to the cloud is privacy of access patterns. There are many exisiting solutions for encrypting outsourced data, but it is usually accepted that the cloud will be able to see when, where and how often you access it. It is often difficult to quantify the information this access pattern leaks to an interested party. For instance, a cloud storing medical data may not be able to read individual patient records, but it could infer that a person has cancer by the fact that an oncologist is accessing his/her records.

Existing techniques that mitigate this problem (oblivious RAM and private information retrieval protocols) currently require too much computation or bandwidth to be usable in cloud situations. PIR techniques can have low bandwidth requirements, but must inherently perform some computation over the entire database for each query. ORAM, in contrast, has low computational cost but must occasionally transfer large portions of the database back to the client to be processed. Since outsourcing data to the cloud only makes sense if you have a lot of it, these techniques are not currently practical for the large database sizes that would occur in such a situation.

Our approach to this problem is to take advantage of the large computational resources that are available in cloud settings to make fast PIR queries possible for large sets of data. We have designed a version of PIR that can run efficiently on MapReduce and scale to a large number of compute nodes. This allows the client to decide how fast they would like the query to return and assign resources accordingly.

See also: Travis Mayberry, Erik-Oliver Blass, Agnes Hui Chan, “PIRMAP: Efficient Private Information Retrieval for MapReduce”, Proceedings of Financial Cryptography and Data Security (FC’13), pp. 371—385, Okinawa, Japan, 2013, ISBN 978-3-642-39883-4 Paper
MapReduce Implementation

PRISM

Privacy-Preserving Search in MapReduce

We present PRISM, a privacy-preserving scheme for word search in cloud computing. In the face of a curious cloud provider, the main challenge is to design a scheme that achieves privacy while preserving the efﬁciency of cloud computing. Solutions from related research, like encrypted keyword search or Private Information Retrieval (PIR), fall short of meeting real-world cloud requirements and are impractical.

PRISM’s idea is to transform the problem of word search into a set of parallel instances of PIR on small datasets. Each PIR instance on a small dataset is efﬁciently solved by a node in the cloud during the “Map” phase of MapReduce. Outcomes of map computations are then aggregated during the “Reduce” phase. Due to the linearity of PRISM, the simple aggregation of map results yields the ﬁnal output of the word search operation. We have implemented PRISM on Hadoop MapReduce and evaluated its efﬁciency using real-world DNS logs.

PRISM’s overhead over non-private search is only 11%. Thus, PRISM offers privacy-preserving search that meets cloud computing efficiency requirements. Moreover, PRISM is compatible with standard MapReduce, not requiring any change to the interface or infrastructure.

See also: Erik-Oliver Blass, Roberto Di Pietro, Refik Molva, Melek Onen, “PRISM – Privacy-Preserving Search in MapReduce”, Proceedings of Privacy Enhancing Technologies Symposium (PETS’12), pp. 180—200, Vigo, Spain, 2012, ISBN 978-3-642-31679-1 Paper
MapReduce Implementation

EPiC

Efficient Privacy-Preserving Counting for MapReduce

In the face of an untrusted cloud infrastructure, outsourced data needs to be protected. Fully homomorphic encryption is one solution that also allows performing operations on outsourced data. However, the involved high overhead of today's fully homomorphic encryption techniques outweigh cloud cost saving advantages, rendering it impractical. We present EPiC, a practical, efficient protocol for the privacy-preserving evaluation of a fundamental operation on data sets: frequency counting. In an IND-CPA encrypted outsourced data set, a cloud user can specify a pattern, and the cloud will count the number of occurrences of this pattern in a completely oblivious manner. A pattern is expressed as a boolean formula on the fields of the records and can specify values counting, range counting, and conjunctions/disjunctions of field values.

EPiC's main idea is, first, to reduce the problem of counting to a summation of polynomial evaluations. Second, to efficiently evaluate the summation of polynomial evaluations in a privacy-preserving manner, we extend previous work on the Hidden Modular Group Order assumption and design a new somewhat homomorphic encryption scheme. We show how a general pattern, defined by a boolean formula, is arithmetized into a multivariate polynomial over GF(2) and used in EPiC. This scheme is highly efficient in our particular counting scenario.

Besides a formal analysis where we prove EPiC's privacy, we also present implementation and evaluation results. We specifically target Google's prominent MapReduce paradigm as offered by major cloud providers. Our evaluation performed both locally and in Amazon's public cloud with data sets sizes of up to 1 TByte shows only modest overhead compared to non-private counting, attesting to EPiC's efficiency.

See also: Erik-Oliver Blass, Guevara Noubir, Triet Vo Huu, “EPiC: Efficient Privacy-Preserving Counting for MapReduce”, Cryptology ePrint Archive: Report 2012/452, 2012 Paper
MapReduce Implementation

Path-PIR

Efficient Private File Retrieval by Combining ORAM and PIR

Recent research results on tree-based Oblivious RAM by Shi et al. [14] obtain communication complexity of O(l · log3(N)) in the worst-case for an N-capacity storage with blocks size l. The individual nodes in the tree, however, are constructed using traditional ORAMs which have worst-case communication complexity linear in their capacity and block size. PIR protocols are able to provide better worst-case bounds (decoupling capacity from block size), but have traditionally been less practical than ORAM due to the fact that they require O(N) computational complexity on the server. This paper presents Path-PIR, a hybrid ORAM construction, using techniques from PIR, that overcomes the individual weaknesses of each. Path-PIR significantly reduces communication complexity when the block size of the ORAM is large. Compared to existing work, this leads to smaller data transfer costs by orders of magnitude for practical sized databases and achieves worst-case communication complexity of O(l · log2 (N)) for large block sizes. Additionally, the typically high computational cost of PIR is negated by the tree structure of the ORAM, which requires only a small fraction of the database to be operated on for each query. We also investigate the concept of an ORAM’s latency, which is the amount of communication required before users receive the result of their query. We show that Path-PIR achieves lower latency than any existing scheme, only about four times the block size. Using Amazon EC2 as an example, we demonstrate that even with the additional cost of PIR computation, Path-PIR provides a significant monetary saving compared to related work.

See also: Travis Mayberry, Erik-Oliver Blass, Agnes Chan, “Efficient Private File Retrieval by Combining ORAM and PIR”, Proceedings of 20th Annual Network & Distributed System Security Symposium (NDSS ‘14), pp. 1–11, San Diego, USA, 2014, Distinguished Paper Award Paper
Amazon Implementation

RASP

Practical Forward-Secure Range and Sort Queries with Update-Oblivious Linked Lists

We present RASP, a new protocol for privacy-preserving range search and sort queries on encrypted data in the face of an untrusted data store. The contribution of RASP over related work is twofold: first, RASP improves privacy guarantees by ensuring that after a query for range [a,b] any new record added to the data store is indistinguishable from random, even if the new record falls within range [a,b]. Second, RASP is highly practical, abstaining from expensive asymmetric cryptography and bilinear pairings. Instead, RASP only relies on hash and block cipher operations. The main idea of RASP is to build upon a new update-oblivious bucket-based data structure. We allow for data to be added to buckets without leaking into which bucket it has been added. As long as a bucket is not explicitly queried, the data store does not learn anything about bucket contents. Furthermore, no information is leaked about data additions following a query. Besides formally proving RASP's privacy, we also present a practical evaluation of RASP using Amazon Dynamo.

See also: Erik-Oliver Blass, Travis Mayberry, Guevara Noubir, “Practical Privacy-Preserving Range and Sort Queries with Update-Oblivious Linked Lists”, Cryptology ePrint Archive: Report 2013/715, 2013 Paper

HiVE

Toward Robust Hidden Volumes using Write-Only Oblivious RAM

HiVE resides on its own homepage.

Resizable ORAM

Resizable Tree-Based Oblivious RAM

Although newly proposed, tree-based Oblivious RAM schemes are drastically more efficient than older techniques, they come with a significant drawback: an inherent dependence on a fixed-size database. Yet, a flexible storage is vital for real-world use of Oblivious RAM since one of its most promising deployment scenarios is for cloud storage, where scalability and elasticity are crucial. We revisit the original construction by Shi et al. [ASIACRYPT '11] and propose several ways to support both increasing and decreasing the ORAM’s size with sublinear communication. We show that increasing the capacity can be accomplished by adding leaf nodes to the tree, but that it must be done carefully in order to preserve the probabilistic integrity of data structures. We also provide new, tighter bounds for the size of interior and leaf nodes in the scheme, saving bandwidth and storage over previous constructions. Finally, we define an oblivious pruning technique for removing leaf nodes and decreasing the size of the tree. We show that this pruning method is both secure and efficient

See also: Tarik Moataz, Travis Mayberry, Erik-Oliver Blass, Agnes Hui Chan: Resizable Tree-Based Oblivious RAM. In Financial Cryptography and Data Security (FC’15) Paper

r-ORAM

Recursive Trees for Practical ORAM

We present a new, general data structure that reduces the communication cost of recent tree-based ORAMs. Contrary to ORAM trees with constant height and path lengths, our new construction r-ORAM allows for trees with varying shorter path length. accessing an element in the ORAM tree results in different communication costs depending on the location of the element. The main idea behind r-ORAM is a recursive ORAM tree structure, where nodes in the tree are roots of other trees. While this approach results in a worst- case access cost (tree height) at most as any recent tree- based ORAM, we show that the average cost saving is around 35% for recent binary tree ORAMs. Besides reducing communication cost, r-ORAM also reduces storage overhead on the server by 4% to 20% depending on the ORAM’s client memory type. To prove r-ORAM’s soundness, we conduct a detailed overflow analysis. r- ORAM’s recursive approach is general in that it can be applied to all recent tree ORAMs, both constant and poly-log client memory ORAMs. Finally, we implement and benchmark r-ORAM in a practical setting to back up our theoretical claims.

See also: Tarik Moataz, Erik-Oliver Blass, Guevara Noubir “r-ORAM: Recursive Trees for Practical ORAM”, In 15th Privacy Enhancing Technologies Symposium (PETS 2015) (to appear) Paper
Python Implementation

PASMAC

Privacy and Security for MapReduce Clouds

People

PhD Students

Faculty