COMP3008 BIG DATA ANALYTICS

COMP3008
BIG DATA ANALYTICS
20 CREDIT MODULE
ASSESSMENT: 100% Coursework W1: 30% Set Exercises
W2: 70% Report
MODULE LEADER: Dr Marco Palomino
marco.palomino@plymouth.ac.uk
DEADLINE
Set Exercises: Tuesday 20 April 2021 at 16:00
Report: Thursday 13 May 2021 at 16:00
MODULE AIMS:
1. To introduce students to the fundamentals of non-relational (NoSQL) databases.
2. To critically evaluate the differences between relational and NoSQL databases.
3. To gain experience pre-processing data files adequately for the use of NoSQL databases.
4. To gain experience with NoSQL databases through hands-on projects.
ASSESSED LEARNING OUTCOMES (ALO):
ALO1:
Critically compare and contrast the differences between relational and non-relational databases.
ALO2:
Critically appraise non-relational database strengths and weaknesses.
ALO3:
Demonstrate the ability to perform all the CRUD operations (namely, create, retrieve, update and delete) on non-relational databases.
1. SET EXERCISES
Police.UK has published data provided by the 43 geographic police forces in England and Wales, the British Transport Police, and the Police Service of Northern Ireland. In the first part of your coursework, you will be working with the data created specifically by Essex Police in 2018, reporting on street-level crime incidents and their outcomes.
The data supplied by Police.UK is available on the DLE (look for the Coursework section). The data is separated in 12 CSV files reporting incidents (one per month), and 12 CSV files reporting the outcomes of such incidents (one per month). The definitions for the fields available in the CSV files are stated below (note that some of these fields refer to incidents, and others refer to outcomes):
Field
Meaning
Crime ID
A unique identifier for each incident in the dataset and its corresponding outcome (or outcomes).
Reported by
The force that provided the data about the crime.
Falls within
At present, this is also the force that provided the data about the crime, but it is likely to change in the future.
Longitude and Latitude
The anonymized coordinates of the crime. Information on Location Anonymization is available at:
https://data.police.uk/about/#location-anonymisation
LSOA code and LSOA name
References to the Lower Layer Super Output Area that the anonymised point falls into, according to the LSOA boundaries provided by the Office for National Statistics.
Crime type
One of the crime types listed in the Police.UK FAQ.
Outcome type
A reference to the outcome associated with the crime incident. For example, ‘Formal action is not in the public interest’. Over time, a single incident may have several outcomes.
You should carry out the following exercises.
1.1 Exercise 1 (15 marks)
Create a Neo4j database to store the data comprised in the CSV files. The database should respect the data model displayed in Figure 1. You have to provide all the commands needed to create the database and populate it with the data in the CSV files, and you must provide them in the exact order you propose to execute them. If you create indexes, you must also include the commands for index creation. Your database will be recreated, and the only way to do so is by following the commands that you will provide, in the order in which you provide them.
Figure 1. Data Model (Essex Police – 2018)
Find below an example showing how your commands should be listed (the commands shown below are not the actual answer, but they are meant to illustrate the explanation):
// Create an incident
CREATE (:Incident {crime_id:…
// Create an outcome
CREATE (:Outcome {crime_id:…
// Create an index

1.2 Exercise 2 (5 marks)
Produce a Neo4j query to list all the crime types contained in the database followed by the number of incidents corresponding to each crime type. The crime type with the highest number of incidents should appear at the top, the crime type with the second highest number of incidents should appear next, and so on. Your answer must show the query followed by the result. You can find below a sample answer (the query and results shown below are only meant to illustrate the explanation).
Query:
MATCH (i:Incident) AND (c:Crime_Type)
RETURN …
Result:
Anti-social behaviour 55
Burglary 43

1.3 Exercise 3 (5 marks)
Produce a Neo4j query to identify the number of crime incidents reported by Essex Police per month in 2018. The query should list the month followed by the number of crime incidents corresponding to that month (the month with the highest number of incidents should appear at the top of the list, the month with the second highest number of incidents should appear next, and so on). You should submit the query and the result. Your answer should look similar to the following,
Query:
MATCH (i:Incidents per Month)
RETURN…
Result:
2018-03 105
2018-05 99
2018-12 78

1.4 Exercise 4 (5 marks)
Produce a Neo4j query to identify the ten most common locations in the database. The query should list each location followed by the number of crime incidents corresponding to that location (the location with the highest number of incidents should appear at the top of the list, the location with the second highest number of incidents should appear next, and so on). You should submit the query and the result. Your answer should look similar to the following,
Query:
MATCH (l:locations)
RETURN …
Result:
On or near Dolphin Gardens 55
On or near Ricketts Drive 45
On or near Tensing Gardens 18

The answers to the four exercises listed above should be submitted via the DLE on 20 April 2021 by 16:00. These exercises represent 30% of your final mark. Note that these exercises must be submitted in a plain text file (TXT file extension).
2. REPORT (RECOMMENDATION ALGORITHMS)
Last.fm is a music website which employs a music recommender system. Last.fm builds detailed profiles of each user’s musical taste, by recording details of the tracks the user listens to using her own computer or portable devices. This information is transferred to Last.fm’s database, and it is then used to create recommendations for individual users.
Let us assume that you are working for a consultancy company called Plymouth IT Consultants, which has been hired by Last.fm to redesign its recommender system. As Chief Data Officer at Plymouth IT Consultants, you have been asked to propose a new recommendation algorithm based, entirely, on graph databases. Using the material introduced in COMP3008, you will write a 3000-word report discussing your proposed recommendation algorithm. Essentially, you have to explain how the algorithm will work and justify why this is the best way to do it.
It may be the case that you propose a combination of algorithms, either taught in COMP3008 or researched on your own. While this seems a promising alternative, you will have to clarify how exactly you attempt to combine the algorithms—which one you will apply first, how much “importance” will be assigned to each algorithm, and what is the actual advantage of the combination.
Note that your report should focus, exclusively, on the Last.fm recommendation algorithm. Evidently, Last.fm has different databases to deal with the different aspects of the business. For example, Last.fm has a database to record customer subscriptions and payments, and a database for its employees’ payroll. However, you have not been asked to work with such databases. You should only consider what is relevant to create recommendations. Make sure your submission states the step-by-step sequence of actions that you will follow to implement your proposed algorithm. If your algorithm requires the application of mathematical formulae, state such formulae clearly, so that we can understand how your proposal can be implemented.
3. REPORT REQUIREMENTS
The report should fulfil the following requirements.
1. You do not need to present textbook or COMP3008 lecture material describing databases, and little or no credit will be given to such material. You should write your report so that it is understandable and accessible to your colleagues at Plymouth IT Consultants, who you can assume are familiar with the COMP3008 material and do not wish to see it reiterated in your report.
2. Your report should provide an overview of the application of graph databases to recommendation systems. This should also cover the benefits of using a graph database in this area, but do not let these benefits become the focus of your report.
3. Your report should provide details of the application of graph databases to the recommendation system you are proposing for Last.fm:
− You will need to explain what your nodes will represent and what your relationships will be. You will have to provide a graph data model—make sure it is complete and covers all node types and relationships.
− Explain how data relationships can be adequately exploited to match a client’s preferences or friends’ recommendations.
− Consider how data relationships can be used to make recommendations like “your friends also listened to this song”. Make sure that your discussion is based upon evidence from the literature. Make sure you stay clear of speculation.
4. Your work on this assignment should involve considerable reading and research from multiple appropriate sources—the Internet is a good place to start, but do not stop there; consider also books and academic articles. Include a reference list and optionally a bibliography, and make sure that your report contains citations to the articles in your reference list.
5. When citing sources, ensure that you provide clear and explicit information about the contents of the cited source, so that the relationship between the contents of the source and the points being made are clear
4. REPORT SPECIFICATION
1. The answers to the four exercises specified in Section 1 must be provided in a separate document. These answers amount to 30% of your final mark.
2. The maximum length for the report should be 3000 words. Include a word count immediately after your report title—a penalty will apply for omitting the word count. Reports exceeding the maximum word length will be penalised. Your reference list and bibliography are not included in the word count. If you produce appendices, these are not included in the word count.
3. Your report should be entirely your own work.
5. REPORT STRUCTURE
I recommend that you should abide by the following outline:
1. Introduction (200 words).
2. Overview of the use of graph databases in recommendation systems (1000 words).
3. Data model suggested (600 words).
4. Proposed recommendation algorithm (1000 words).
5. Conclusions (200 words).
6. Reference list and bibliography.
Consider the following points before starting your write-up:
1. You will be writing a report, not a journal or conference paper—hence, there is no need for an Abstract. You are not writing a book either. Thus, there is no need for Aims and Objectives; Methodology; Foreword; Table of Contents.
2. Given that your report is intended for dissemination in a professional business environment, the quality of English should be suitable for this purpose—and marks will be deducted for reports that do not meet this requirement:
− Please avoid conversational English.
− Please avoid overly long sentences: 30 words is a long sentence and needs strong use of proper punctuation to be intelligible. I would suggest that you reconsider sentences once they go over 20 words.
− Ensure that your writing is clear, concise and business-like. When writing lacks clarity, everything that you are trying to say becomes obscure: keep it simple, readable and clear. Conciseness helps the reader to understand the intended message; it will also help you to meet the word limit.
− Ensure that your report has a clear structure in terms of numbered sections, sub-sections (if necessary) and paragraphs.
The assignment will be introduced in class to provide further clarity over what is expected and how you can obtain formative feedback prior to the submission. Whilst the assessment information is provided in Week 3 of the module, it is not necessarily expected you will start this immediately – as you may not have sufficient understanding of the topic. The module leader will provide guidance in this respect.
6. ASSESSMENT CRITERIA
The four exercises listed in Section 1 represent 30% of your final mark. Note these four exercises must be submitted via the DLE as a plain text file (TXT file extension). The deadline for this is 20 April 2021 by 16:00.
Exercise
Weighting
1.1
Create a Neo4j database for Essex Police
15 marks
1.2
Crime types and their corresponding number of incidents
5 marks
1.3
Crime incidents per month
5 marks
1.4
Crime incidents per location
5 marks
The remaining marks in this module will be awarded for the report. The report must be submitted as a Microsoft WORD file (DOC or DOCX file extension). The report should fulfil the following criteria, and it represents 70% of your final mark. The deadline for the report is 13 May 2021 by 16:00.
Criteria
Weighting
1
Is there a properly constructed reference list and are the elements of such a list cited appropriately in the report?
5 marks
2
Is the research carried out appropriate in terms of breadth and depth? Is the relation between the contents of the cited sources and the contents of the report clearly explained?
10 marks
3
Does the report provide an overview of the specific application of graph databases to recommendation systems (together with the benefits of such), and is this clearly and explicitly based upon evidence and citations from the literature?
10 marks
4
Does the report provide a graph data model which adequately describes the data and relationships? Is the model supported by appropriate evidence, examples or citations from the literature?
20 marks
5
Does the report propose a sensible algorithm? Is the algorithm applicable to a music recommendation system? Does the report identify reasons for its applicability? Is this clearly and explicitly supported by evidence, examples or citations from the literature? Are such examples presented and discussed in sufficient depth?
20 marks
6
Is there a set of conclusions, and do they provide a reasonable summary – at an appropriate level of abstraction – based upon the contents of the report?
5 marks
7. GRADE CRITERIA
When awarding marks for individual criteria, I shall employ the following guidelines.
Mark
Criteria
0-49%
The quality of the work has not met the learning outcomes. The understanding and application of fundamental concepts and techniques is questionable. Work of this quality would not be acceptable in professional employment.
50-59%
The quality of work has only met the threshold level but still requires further work to get it to a better standard. The submission contains logical or analytical errors related to analysis and design techniques. It only demonstrates a basic understanding of the subject competence. Further improvement is required to demonstrate personal thoroughness, effort and independent learning.
60-69%
The quality of the work submitted suggests that you are able to apply the analysis and design techniques well. The work you have submitted is substantially correct and complete. It demonstrates a good understanding of subject competence and personal thoroughness, effort and independent learning.
More than 70%
The quality of work is outstanding with no significant flaws. It demonstrates a high level of subject knowledge and competence; personal thoroughness, effort and independent learning; and possibly significant additional analytical/critical thought. Well done!
8. GENERAL GUIDANCE
Extenuating Circumstances
There may be a time during this module where you experience a serious situation which has a significant impact on your ability to complete the assessments. The definition of these can be found in the University Policy on Extenuating Circumstances here:
https://www.plymouth.ac.uk/uploads/production/document/path/15/15317/Extenuating_Circumstances_Policy_and_Procedures.pdf
Plagiarism
All of your work must be of your own words. You must use references for your sources, however you acquire them. Where you wish to use quotations, these must be a very minor part of your overall work.
To copy another person’s work is viewed as plagiarism and is not allowed. Any issues of plagiarism and any form of academic dishonesty are treated very seriously. All your work must be your own and other sources must be identified as being theirs, not yours. The copying of another persons’ work could result in a penalty being invoked.
Further information on plagiarism policy can be found here:
Plagiarism:
https://www.plymouth.ac.uk/student-life/your-studies/essential-information/regulations/plagiarism
Examination Offences:
https://www.plymouth.ac.uk/student-life/your-studies/essential-information/exams/exam-rules-and-regulations/examination-offences
Turnitin (http://www.turnitinuk.com/) is an Internet-based ‘originality checking tool’ which allows documents to be compared with content on the Internet, in journals and in an archive of previously submitted works. It can help to detect unintentional or deliberate plagiarism.
It is a formative tool that makes it easy for students to review their citations and referencing as an aid to learning good academic practice. Turnitin produces an ‘originality report’ to help guide you. To learn more about Turnitin go to:
https://guides.turnitin.com/01_Manuals_and_Guides/Student/Student_User_Manual
Referencing
The University of Plymouth Library has produced an online support referencing guide which is available here:
http://plymouth.libguides.com/referencing
Another recommended referencing resource is Cite Them Right Online; this is an online resource that provides you with specific guidance about how to reference lots of different types of materials.
The Learn Higher Network has also provided a number of documents to support students with referencing:
References and Bibliographies Booklet:
http://www.learnhigher.ac.uk/writing-for-university/referencing/references-and-bibliographies-booklet/
Checking your assignments’ references:
http://www.learnhigher.ac.uk/writing-for-university/academic-writing/checking-your-assigments-references/


Buy plagiarism free, original and professional custom paper online now at a cheaper price. Submit your order proudly with us



Essay Hope