# Big Data and related discussion

Week 2 – Big Data and related discussion

Imagine the following scenario:

You (i.e. you personally) are tasked with calculating the average from a set of numbers. Can you do it? I expect you would answer Yes. However, what if you had to calculate this average from a set of 1000 raw numbers, and you have to do it by hand – no calculators or other electronic means. You could still do it, but it would take a long long time. Is there a better way?

A better way would be to distribute the work. You round up 100 other students, and each of you will now calculate, by hand, an average of a set of only 10 numbers. So we’ve taken the 1000-number dataset and given each of the 100 students 10 numbers to work with. Now, the task is relatively easy, no?

And when each of you comes up with your average from the 10 numbers, we can collectively then take the average of the averages — and that would be our final answer. And we’re done.

Several things to note — The long computation done by one person would take a long time to finish. That’s how a traditional Relational Database Management System (RDBMS) would do it. Traditional database systems do things in sequence and it can take a long time if it’s a lot of data. In contrast to an RDMBS, a BigData environment will distribute the task. Here’s how it’s different using BigData setup — The 100 students is analogous to 100 computing nodes, each node working on a piece of the problem, each node an independent computer. The distribution of work, and the collection of intermediate averages is like the MapReduce function you’ve seen in the reading. In MapReduce, work is mapped out, and then reduced back. Note also that 100 students can work in parallel, or independently, or simultaneously. That’s analogous to MPP or multi-parallel-processing. And that’s why the answers can come back much much faster. BigData can compute really really fast. Furthermore if one computing node (student) crashes (falls ill), the work need not stop. If a node goes down, it’s not really a big deal. That’s analogous to computing nodes being considered commodity components – i.e. cost effective hardware (cheap labor).

If you keep this analogy in mind, about how a BigData environment is different than a traditional computation, you’ve got the key concept of how it works, why it’s fast, and why it can be cost-effective.

For the discussion this week, think of how the “3 V’s” of BigData and how each of these V’s and related concepts would apply to the problem-scenario that you used in part 1 of this week’s HW.

How might a BigData approach be used to help match students with job opportunities, and vice versa? Discuss . . .