EECS 489 Lab 3: ImageDB with Bloom Filter

This assignment is due on Fri, 5 Feb 2016, 6 pm. Support code won't be released until after PA1 due date

Introduction

In terms of number of lines you need to write to complete this lab, it is very short. You only need to write about 10 lines of code. Two lines for Task 1, 8 lines for Task 2. The amount of time needed depends on how comfortable you are with modular arithmetic and bitwise operations.

We assume a client-server setup in this lab. The server (imgdb) will eventually be our distributed hash table (DHT) node, but in this lab, we assume there is only one such node.

 % imgdb [ -b <beginID> -e <endID> ]

Upon start up, the server loads up its database with images from an "images" folder under the current working directory (where you run the server from). For each image, the server computes a SHA1 value from which an ID is derived. Only those images whose IDs fall within the range of the server's IDs will be loaded onto its database. When an image is loaded onto a database, it is also entered into a Bloom filter by computing three indices from the above SHA1 value. The function to load the database and populate the Bloom filter is provided to you in the support code under function imgdb::loaddb(). You should study this function carefully to see how to generate a SHA1 value from an image name and also how to generate an ID and populate the Bloom filter from the SHA1 value. You can also review the lecture on Bloom filter (p. 7, slide #28). By default, the range of the server is (0, 0], i.e., the full identifier ring. (In math, the parentheses are used to indicate that the range does not include the value specified (open), and the brackets are used to indicate that it does (closed). In this case, the range is a real number greater than zero all the way round the identifier ring back to zero. The start of the range doesn't include zero, but the end does.) You can use the -b and -e command line options to set the start and end values of the server's ID range.

The client (netimg) is exactly the client in Lab1. The full client code is provided as part of the support code. You don't have to write any client code. If you have written your own client code for PA1, you may use it instead, though your imgdb implementation must interoperate with the provided netimg.

Assumptions

We make some assumptions for this and subsequent labs and for PA2.

We assume an object ID size of 8 bits. To compute an object ID, we "fold up" a 160-bit SHA1 value into 8 bits. So the probability of IDs colliding become much higher. For the images, once we have a hit on the Bloom filter, we simply do a linear search of the database. A match requires matching both the image's ID and name, which also resolves any hashing collision (false positive) for us.
We assume a fixed maximum size, IMGDB_MAXDBSIZE, of the image database. Once this capacity is reached, we simply print out a message to inform the user that we're not adding more images, but the server continues to run otherwise.
We assume that once loaded, images are never removed. So we don't have to worry about holes in the database or resetting the Bloom filter.
We assume that only one image is read into memory at any one time. Each time there is a search hit, the image will be read from file.

Task 1: Circular ID and Bloom Filter

Your first task is to write the function ID_inrange(ID, begin, end) in hash.cpp. Given an ID, return true (1) if ID is in the range (begin, end] modulo HASH_IDMAX+1, defined in hash.h. For example, 147 is in the range (138, 150] but not in the range (150, 200], whereas 210 is in the range (200, 10]. This function is used by imgdb::loaddb(), so you can observe its working by modifying the server's ID range (using -b and -e command line options) and watching which images in the database are loaded.

Next, populate the Bloom Filter, imgdb::bf. In imgdb::loadimg(), every time an image name is added to the database, compute three Bloom Filter indices (locations) using the function hash.cpp:bfIDX(). We use the three constants/macros BFIDX1, BFIDX2, and BFIDX3 to compute the three indices. These are defined in hash.h. Once an index is computed, use it to set the corresponding bit on the Bloom Filter. See the online comments in imgdb::loadimg() for further instructions.

Each of the above takes one line of code.

Task 2: Image Database Search

Your second task is to complete the imgdb::searchdb() function, to check the Bloom filter for the presence of an image in the database. The function imgdb::searchdb() is called from imgdb::handleqry(). To call imgdb::searchdb(), you first compute the SHA1 message digest of the given imgname. From the computed SHA1 value, you compute the ID of the image. You want to throughly understand how imgdb::loaddb() works before attempting to complete this task, which should take about 6 lines of code: 2 lines in imgdb::handleqry(), the remainder in imgdb::searchdb(), both in the file imgdb.cpp.

Support Code

Since Lab3 support code contains solution to parts of PA1, it will be made available only to those who have turned in PA1. To those who have turned in PA1, the support code will be available as lab3.tgz in the Course Folder by Friday, 1/29, after PA1's due time. At that time, we will download your PA1 submission from your EECS 489 CTools Drop Box for grading. If you want to submit your PA1 late, do not put any PA1 submission file in your Drop Box by the due date. If there is a PA1 file in your Drop Box by the due date, we will assume you will not be doing any late submission and we won't be grading any later submission, without exception. You will not have access to the Lab3 support code until you have turned in your PA1. If you've decided not to turn in PA1, please email the course instructor and you will be given access.

You can also find the reference implementeation refimgdb and an images folder in /afs/umich.edu/class/eecs489/w16/lab3. If you'd like to download the images to your own computer, you can grab images.tgz (about 25 MB). As usual refimgdb was compiled on CAEN's Red Hat 7, so don't try to run them on your Mac OS X or Windows machines. Recall that the complete source code for netimg is included in the support code, so you should be able to build the client on your local platform. The support code has been built and tested on Linux, Mac OS X, and Windows. If you're not using the provided Makefile, note that imgdb.cpp must be compiled with the compiler option -DLAB3 for the main() function to be included.

On Ubuntu and Windows, you'd need to install the OpenSSL library to build imgdb. On Ubuntu, assuming you have sudo privileges, do:

sudo apt-get install libssh-dev

On Windows, please refer to the section of the Building Socket Program course note for links and instructions to install and use the OpenSSL library. You'll also need to add the compiler flag /DLAB3 to your project's properties. If you don't know how to do this, follow the instructions in the course note.

Testing Your Code

Run imgdb without any command line option. Run netimg to connect to the running imgdb and request for ShipatSea.tga. The image should be served and displayed. Now run:

% imgdb -b 220 -e 20

You should see "*in range*" printed next to the name of each image whose ID is within your imgdb's ID range.

Next run netimg to connect to the running imgdb and request for ShipatSea.tga. Assuming the ID you compute for ShipatSea.tga is outside the (220, 20] range, you should get an

imgdb: ShipatSea.tga: Bloom filter miss.

message on server side and netimg: ShipatSea.tga image not found.

on the client side. Test for other boundary conditions.

Submission Instructions

As with Lab 1, to incorporate publicly available code in your solution, or to pass off the implementation of an algorithm as that of another are both considered cheating in this course. If you can not implement a required algorithm, you must inform the teaching staff when turning in your assignment.

Your submission must compile and run without errors on CAEN eecs489 hosts using the provided Makefile, unmodified, without any additional libraries or compiler options.

Your "Lab3 files" comprises your hash.cpp and imgdb.cpp files.

To turn in your Lab3, upload a zipped or gzipped tarball of your Lab3 files to the CTools Drop Box. Keep your own backup copy! The timestamp on your uploaded file is your time of submission. If this is past the deadline, your submission will be considered late. You are allowed multiple "submissions" without late-policy implications as long as you respect the deadline. We highly recommend that you use a private third party repository such as github or M+Box or Dropbox or Google Drive to keep the back up copy of your submission. Local timestamps can be easily altered and cannot be used to establish your files' last modification times (-10 points). Be careful to use only third-party repository that allows for private access. To put your code in publicly accessible third-party repository is an Honor Code violation.

Turn in ONLY the files you have modified. Do not turn in support code we provided that you haven't modified (-4 points). Do not turn in any binary files (object, executable, dll, library, or image files) with your assignment (-4 points). Your code must not require other additional libraries or header files other than the ones listed in the Makefile (-10 points).

Do remove all printf()'s or cout's and cerr's and any other logging statements you've added for debugging purposes. You should debug using a debugger, not with printf()'s. If we can't understand the output of your code, you will get zero point.