Writing Python Homework ,R Experiment Help With,Writing Python Homework ,PenCloud Homework Help

Spring 2018
Final project: PenCloud
Teams must form. by March 18, 10:00pm EDT
Project proposal due on arch 26, 10:00pm EDT
Work-in-progres demos in the wek April 9-13
Final demos in the finals wek April 30-May 8
Project code and report due on the evening of your demo (10:00pm EDT)
1 Overview
The final project is to build a smal cloud platform, somewhat similar to Google Apps, but obviously with
fewer features. The cloud platform. wil have a webmail service, analogous to Gmail, as wel as a storage
service, analogous to Google Drive.

The figure on the right ilustrates the high-level structure of the
system. Users can connect to a set of frontend servers with their
browsers and interact with the services using a simple web
interface. Each frontend server runs a smal web server that
contains the logic for the diferent services; however, it does not
keep any local state. Instead, al state is stored in a set of backend
servers that provide a key-value store abstraction. That way, if one
of the frontend servers crashes, users can simply be redirected to a
diferent frontend server, and it is easy to launch additional frontend servers if the system becomes
overloaded.

The project should be completed in teams of four. There are several diferent components that need to
interact properly (this is a true "software system"!), so it is critical that you and your teamates think
carefully about the overal design, and that you define clear interfaces before you begin. In Section 3, we
have included some example questions you may want to discuss with your team. It is also very important
that you work together closely, and that you regularly integrate and test your components – if you build the
components separately and then try to run everything together two hours before your demo, that is a sure
recipe for disaster. To make integration easier, we wil provide shared Git repositories for each team (after
al teams have been formed). Please do not use Github or some other web repository for this project.

In the specification below, we have described a minimal solution and a complete solution for each
component. The former represents the minimum functionality you wil need to get the project to work; we
recommend that you start with this functionality, do some integration testing to make sure that al the
components work together, and only then add the remaining features. The later represents the functionality
your team would need to get full credit for the project. Finaly, in Section 5, we describe some suggestions
CIS 505: Software Systems
for additional features that we would consider to be extra credit. The set of extra-credit features is not fixed,
however; you should also fel fre to be creative and add functionality of your own.

The project must be implemented entirely in C or C+. You may not use external components (such as a
third-party web server or key-value store, external libraries, scripting languages, etc.) unles we explicitly
approve them.
2 Major components
2.1 Key-value store
Your system should store al of its user data in a distributed key-value store, somewhat analogous to
Google's Bigtable. (Unles you want to, you do not need to implement its more advanced features; al you
need to know is the interface below.) Conceptualy, the storage should appear to applications as a giant
table, with many rows and many columns. The storage system should support at least the following four
operations:
• PUT(r,c,v): Stores a value v in column c of row r
• GET(r,c): Returns the value stored in column c of row r
• DELETE(r,c): Deletes the value in column c of row r
The table should be sparse, that is, not every row should have to have a value in every column. One way to
implement this could be to store the contents of each row as a set of tuples {(c
are the values in these columns. The row and column names should be strings,
and the values should be (potentialy large) binary values; for instance, applications should be able to invoke
PUT("linhphan", "file-8262922", X), where X is a PDF file that user "linhphan" has stored in the storage
service (se below).

Minimal solution: An initial version of the storage backend could consist of just a single server proces
that listens for TCP connections, acepts the four operations defined above (you and your teamates can
define your own protocol), and stores the data localy. You should be able to reuse some of your HW2MS1
code for this.

Ful solution: The complete version should be distributed: there should be several storage nodes that each
store some part of the table (perhaps a certain range of rows). You may asume that the set of storage nodes
is fixed; for instance, there could be a configuration file that contains the IPs and port numbers of al storage
nodes, analogous to HW3. It should also replicate the data – that is, each value should be stored on more
than one storage node – and it should offer some useful level of consistency as well as some degre of fault
tolerance – that is, it should avoid losing data when nodes crash, and the data should continue to be
acesible as long as some of the replicas are stil alive.
3/11

2.2 Frontend server
Your system should also contain at least one web server, so that users can interact with your system using
their web browsers. Your web server should implement a simple subset of the HTP protocol (RFC2616).
Below is a simple example of a HTP sesion:

C: GET /index.html HTTP/1.1
C: User-Agent: Mozilla
C:
S: HTTP/1.1 200 OK
S: Content-type: text/html
S: Content-length: 47
S:
S: Hello world!

As you can se, the client isues a request for a particular URL (here: /index.html) and potentialy
provides some extra information in header lines (here: information about the user's browser), followed by
an empty line. The server responds with a status code (here: 200 OK, to indicate that the request worked),
potentialy some headers of its own, and then the contents of the requested URL.

Your server should internaly have several handler functions for diferent kinds of requests. For instance,
one function could produce responses to GET / requests, another for POST /login requests, and so on.
You should take care to avoid duplicating code betwen the handler functions; for instance, the handlers
could each return the response as an aray of bytes, and there could then be some common code that sends
these bytes back to the client.

Importantly, your server should check whether the client includes a cookie with the request headers; if not,
it should create a cookie with a random ID and send it back with the response. This is important so that
your server can distinguish requests from diferent clients that are logged in concurrently. For more
information about cookies, please se https:/ww.nczonline.net/blog/2009/05/05/http-cookies-explained/.

Minimal solution: An initial version of the server could be based on the multithreaded server code you
wrote for HW2MS1 (with some adjustments for the diferent protocol). For a quick introduction to HTP,
see htps:/ww.jmarshal.com/easy/http/. Initialy, you may want to just implement GET requests, as in
the above example; to get something working, you can leave out anything nonesential, such as transfer
encodings, persistent connections, or If-modified-since. You can also initialy leave out the cookie handling;
however, keep in mind that without this, only one user wil be able to use the system at a time.

Ful solution: For a fuly functional server, you'l need some additional features, including support for
POST requests (for submiting web forms and uploading files to the storage service), as well as HEAD
requests and cookie handling.
4/11

2.3 User acounts
Your system should support multiple user acounts. When the user first connects to the frontend server (a
GET / request), the server should respond with a simple web page that contains input fields for a username
and pasword. Once the form. is submited, the server should check the storage system to se if the pasword
is correct, and if so, respond with a litle menu that contains links to the user's inbox and file folders (and
perhaps to extra-credit features, if your systems supports any). If the pasword is not correct, the server
should respond with an eror mesage.

Minimal solution: To get something to work quickly, you could simply preload a few usernames and
paswords into the key-value store and check these against the credentials that the user enters.

Ful solution: The complete solution should also alow users to sign up for a new acount, and users should
be able to change their paswords.
2.4 Webmail service
Your system should enable users to view their email inbox, and to send emails to other users, as well as to
email addreses outside the system (e.g., gmail). When the user opens her inbox, she should se a list of
mesage headers and arival times; when she clicks on a mesage, she should be able to se its contents,
and she should be able to delete the mesage, write a reply, or forward it to another addres. There should
also be a way to write a new mesage. Note: The focus here is on the functionality and not on making the
service "look prety" (or Gmail-like), so fel fre to use simple HTML elements to display the emails, e.g.,
for a list of email headers or for editing the text of an email.

Minimal solution: To get something to work quickly, you could restrict email transmisions to users within
your system.

Ful solution: A complete solution should acept emails from outside your system, i.e., from an SMTP
client (e.g., Thunderbird) within your VM or your host machine. (Acepting emails from remote users on
other machines is much harder and requires control over a DNS entry, so this is not required.) For this, you
can adapt the SMTP server from HW2 so that it puts incoming emails into the storage system instead of an
mbox file. It should also be possible to send emails to remote users outside your system (e.g., Gmail or
SEAS email acounts); for this, you'l need to add a simple SMTP client for sending emails. The SMTP
client should use the DNS to look up the MX records for the recipient's domain, and connect to one of the
servers that are specified in these records. Please keep in mind that modern SMTP servers have a variety of
anti-spam easures built in (such as greting delays and temporary rejections); if your client does not work
with external servers but works with your own SMTP server, you may want to have a look at
https:/en.wikipedia.org/wiki/Anti-spam_techniques.
2.5 Storage service
Users should have aces to a simple web storage service, similar to Dropbox or Google Drive. They should
be able to upload files into the system (which would then be stored somewhere in the key-value store), they
should be able to download files from their own storage, and they should be able to se a list of the files
that are currently in their acount. Notice that this is intended as a simple storage service and not as a Google
Docs clone; you need to support uploads and downloads, but not creation or editing.
5/11

Minimal solution: Initialy, you could just implement a flat name space without folders. Users could upload
their files with a HTML form. that contains a element; downloads could simply
be done using HTP GET. It is okay to impose a maximum file size (e.g., a few MBs) so that each file fits
into a single key-value pair.

Ful solution: Your final solution should also have a way to delete files, to create and delete folders, to
rename files and folders, and to move files or folders from one folder to another.
2.6 Admin console
Your system should also contain a special web page that shows some information about the system. The
page should be acesible through some special URL (say, http://localhost:8000/admin). At
the very least, this page should show the nodes in the system (frontend servers and backend servers) and
their current status (alive or down), and it should provide a way to view the raw data in the storage service,
e.g., by showing a table of key-value pairs (maybe ten at a time, with prev/next buttons). It should also
provide a way to disable individual storage nodes, e.g., using a button, so you can test what happens when
a node fails (which is useful in testing fault-tolerance). Depending on which features your team implements,
you may want to add other things to this page; for instance, if you implement recovery, you may want to
add a button that can be used to restart disabled individual storage nodes, so you can test what happens
when a node recovers. It is okay to implement additional methods (besides PUT/GET/..) in your storage
system to support the admin console; for instance, a function for listing row keys may turn out to be useful.
3 Implementation notes
This section contains some tips and suggestions; these are not part of the specification and are simply meant
to make your job easier. Fel fre to implement your system diferently.
3.1 Organizing the comunication
You wil probably find that the components in your system have to communicate with each other frequently.
For instance, if the user wants to view the contents of her storage folder, her browser would send a request
to one of the frontend servers, which in turn would have to send some 'GET' or 'PUT' requests to the storage
nodes. The details of how the frontend server does this are up to you; however, one simple approach would
be to have a 'server loop' in each storage node that opens a TCP port and listens for incoming connections,
just like the SMTP and POP3 servers did in HW2. When the frontend server wants to look up some key-
value pairs, it would open connections to the storage node(s) it needs to talk to, and send its requests over
these connections - perhaps some kind of string. For instance, if the frontend server wanted to delete a key-
value pair, it could send something like DELETE row123 key456. (The details of the protocol are up
to you!) The storage servers could then parse the requests and send responses over the same connection -
again, just like the SMTP and POP3 servers did in HW2.

You do not need to worry about fancy authentication schemes or encryption; if you add these, it would be
considered extra credit. Also, you could consider using third-party serialization frameworks, such as
Google's protobuf/GRPC, Apache Thrift, or Boost; however, please remember to ask for explicit permision
on Piaza before you use any third-party libraries or third-party code.
6/11

3.2 Load balancing and fault tolerance in the front-end
Recal from the first page that the frontend is supposed to be replicated across multiple machines, for load
balancing and fault tolerance. This raises the question how clients would pick the machine that they want
to connect to - users generaly won't know the IP addreses of the frontend machines or how many of them
there are, and they certainly won't want to manualy type these addreses into their browser. Many data
centers contain a network-level load balancer component for this that transparently redirects each new
connection to one of the frontend servers. This is a litle beyond the scope of this project, but you can
approximate this in various ways; for instance, you could build a tiny special-purpose "web server" that
acepts the initial HTP request from new clients and simply responds with a temporary redirect to one of
the real frontend servers. Thus, clients would only need to know the addres of this special web server (and
this would also be the addres that would be stored in the DNS). The special web server could keep track
of which frontend servers are "alive" at any given moment and/or how busy these servers currently are, and
redirect new requests to one of the "live" servers, perhaps even the least busy one. Notice that this special
"web server" would only be involved in the first request from each client; after the redirect, the client would
send further requests to the chosen frontend server directly.

To achieve good fault tolerance, it is probably a good idea not to keep 'hard' state on the frontend servers.
If you store al the state (user acounts, files, emails, ..) in the key-value store, the failure of a frontend
server should not afect clients very much: they can simply connect to the site again and be redirected to a
diferent frontend server; in this case, al of their data would simply be loaded from the key-value store
again. Having said this, you may want to cache key-value pairs on the frontend servers for a short amount
of time in order to improve performance. You could use the conditional put primitive (CPUT) to prevent
inconsistencies: for instance, you could include a version number in important key-value pairs and,
whenever a frontend server needs to change a key-value pair that is in its local cache, it could isue a CPUT
with the cached value and the new value. If the CPUT fails, another frontend server has modified the same
key-value pair.
3.3 Page rendering and sesion management
You do not need to do anything fancy to produce the HTML pages your frontend servers send to the clients;
you could simply write some basic HTML to an internal buffer, roughly as follows:

#define append(x...) do { int space = bufferSizeBytes - strlen(buffer); \
snprintf(buffer[strlen(buffer)], space, x); while (0)

void renderLoginPage(char *buffer, int bufferSizeBytes)
{
...
append("Login\n");
append("Login\n");
append("\n");
append("\n");
...
}

7/11

Then you could write the buffer back to the client over the TCP connection, just like you sent back emails
in your POP3 server from HW2.

You could use cookies to identify diferent clients. To send a cookie to a client, include a Set-Cookie:
header in your HTP response (Example: Set-Cookie: sid=123). This wil cause the browser to
store the key-value pair (here, sid=123) in a local file, and it wil include the key-value pair in al
subsequent requests to the same server (as a Cookie: header, e.g., Cookie: sid=123). One way to
use this is to asociate requests with clients. Suppose, for instance, that client A sends a HTP GET request
to ask for the login form, and is given a sid=123 cookie as sketched above. Later, client A sends a HTP
POST with her username and pasword, say linhphan/secret. Since the cookie is included in that
request, the frontend server can remember that the client with the sid=123 cookie has logged in as
linhphan. If this client now sends a HTP GET for the email inbox page, the server wil recognize that
this is linhphan, and it wil return her emails; if a diferent client sends the same request, it wil have a
diferent cookie, or no cookie at al, so it wil be shown a diferent inbox, or be redirected to the login page
first. (Obviously, for this to be secure, the cookies have to be random and hard to gues.) For more
information about cookies, please se RFC6265.
3.4 Partitioning the storage
One way to divide up the key-value pairs betwen the storage nodes is to define ranges of row keys, similar
to BigTable's 'tablets', and to asign each range to a specific storage node, or to a set of storage nodes. For
instance, suppose the row keys are alphanumeric; then you could have one tablet for row keys that start
with a-d, one for keys that start with e-h, etc. The master could keep the mapping from ranges to storage
nodes, and it could give the mapping to clients upon request, who could then send their GET and PUT
requests to the relevant storage node(s) directly. The tablets should be smal enough to alow good load-
balancing (if you have one huge tablet, one poor storage server has to do al the work!), but they should not
be too smal, either (if you have one tablet per row key, lots of bookkeeping wil be required). The ranges
could even be dynamic; for instance, you could start with a few big tablets and then 'split' tablets at runtime
once they become too large.

If your design contains a master node, please avoid puting it on the 'critical path', e.g., by involving it in
every single GET or PUT operation, or potentialy even sending al the data through it. It is fine to have the
master do some coordination (e.g., remember which row ranges are stored where, keep track of which
storage nodes are currently alive, trigger re-replication of failed tablet copies), but al the 'heavy lifting'
should be done directly betwen the clients and the storage nodes.
3.5 Consistency and fault tolerance
If your storage servers are multithreaded, you'l need to use locks to prevent inconsistencies if multiple
clients are isuing PUTs and GETs in paralel. There are many possible locking schemes – for instance,
locks could be asociated with tablets, rows, or even individual cels. Your team should think about the
tradeoffs we have discussed in clas, and then make a decision.