The Big-O-Notation - Part #2Filed Under: Weekly Tuesday Dose of goodness
- The Big-O-Notation - Part #1
- The Big-O-Notation - Part #2
- The Big-O-Notation - Part #3
Hi all, after such a long time, I’ve finally managed to eek out some time to make this post.
Today, we’ll continue from where we left off previously. Before we begin, I’ll like to reiterate the importance of knowing the Big-O-Notation. This measurement helps measure the worst case performance of an algorithm.
It applies to all algorithms, not just data structures. We talk about it here for 2 reasons.
1) It’s easier to understand data structures since most of us use it in one ways or another
2) Selection of the right data structure is critical to the success of an application development.
Therefore let’s begin (or continue) …
One more thing to note, the data structures we’re talking about refers to the simplest and direct implementation for easy understanding. Our discussion here should have relevance albeit not 100% on the performance measurement. (STL implementations have certain little tricks to make things more optimized than usual)
Just one assumption though, we’ll always assume that binary trees will automatically balance itself regardless of the order of insertion.
Removal
Let’s continue with the performance of removal operations.
Arrays - O ( n ) where n is the size of the array
Single-Linked List - O ( n ) where n is the element size of the linked list
Binary Tree - O (log2 n) where n is the power of 2 based on total number of elements in the binary tree
Why?
Arrays - In some cases, some array implementations do not perform a shift back whenever an element is removed. But since we’re looking for the worst case, in which, most arrays DO perform a shift back, that is moving all the elements from the index after deletion back by 1, the performance is therefore O ( n ). The very worst case occurs when the first element of the array is being removed.
Single-Linked List - The performance hit comes from the need to iterate to the affected node in order to perform the deletion and patching of links. The iteration costs is known as O ( n ). Since it’s O ( n ) for every single removal call.
Binary Tree - The performance hit comes from the need to traverse down the tree as well as a possible variable delay for the traversal comparison itself. A > B or B < A may not necessarily be O ( c ) all the time. Considering that < and > operators may perform complex or linear comparisons as well (such as String comparison). Since we do not that comparison costs into account, it’ll still remain as O ( log 2 N ) for removal.
Iteration
A foreword about iterations. One thing to be understood is that the iteration that we’re talking here goes through all the elements in the data structure.
So why are we talking about it then? This is so that we highlight the differences between a direct access in a for-loop versus iterating the data structure itself.
An iterator is basically a representation of the data structure (whether it’s part of the data structure or IS the data structure is immaterial). It contains highly optimized algorithm (which is actually very simple) to assist in retrieving every single element in the data structure.
How is it different from a simple for-loop?
Let’s have a look.
Arrays:
int myArray[100];
//Isn't this a form of iterator?
for(int index = 0; index < 100; index++)
myArray[index] = rand()%100; //iteration using direct access per access
This is an acceptable iterator for arrays. However, it holds true only if the access speed is fast. In this case, the access speed for Arrays is O ( c ). This is still ideal for iteration even if it’s a container-based Array. Let’s take a look at the std’s implementation:
std::vector<int> myArray(100); //100 specify the size
for(int index = 0; index < 100; index++)
myArray[index] = rand()%100; //iteration using direct access per access
and StridesLib’s implementation:
cArray<int> myArray(100);
for(int index = 0; index < 100; index++)
myArray[index] = rand()%100; //iteration using direct access per access
Linked List:
For linked list, I’ve seen codes like this before: (assume that I’m using StridesLib’s linked list)
cLinkedList<int> list;
list.add(100);
list.add(200);
list.add(300);
list.add(400);
list.add(500);
for(int index = 0; index < list.getSize(); index++)
printf("List element : %d\n", list.get(index));
So what’s wrong with this piece of code? Now, we know that the speed of accessing a single linked list is O ( n ). So many iterations are there in the code above? Let’s work it out:
| Loop Index | Iterations incurred |
| 0 | pHead - 0 |
| 1 | pHead->pNext - 1 |
| 2 | pHead->pNext->pNext - 2 |
| 3 | pHead->pNext->pNext->pNext - 3 |
| 4 | pHead->pNext->pNext->pNext->pNext - 4 |
| Total: | 0+1+2+3+4 = 10 |
So let’s do a forecast graph for linked list access using get():
| Loop Size | Total Iterations incurred |
| 1 | 0 |
| 2 | 1 |
| 3 | 3 |
| 4 | 6 |
| 5 | 10 |
| 6 | 15 |
| 7 | 21 |
| 8 | 28 |
| 9 | 36 |
| 10 | 45 |
| n | O(n) |
Let’s take a look at how the linked list iterator works.
| Loop Index | Iterations incurred |
| 0 | at pHead - 0 |
| 1 | at pHead->pNext - 1 |
| 2 | at pNext->pNext - 1 |
| 3 | at pNext->pNext - 1 |
| 4 | at pNext->pNext - 1 |
| Total: | 4 |
How do we understand the table above? It’s very simple. The keyword is at node. A linked list iterator floats on top of a known node, starts with the head, followed by its next. Every subsequent next is merely an assignment of the next neighbour node, requiring no iteration at all.
So how does an iterator works (if it’s well-implemented) - StridesLib’s variant:
cLinkedList<int> integerList;
integerList.add(100);
integerList.add(256);
integerList.add(384);
integerList.add(128);
integerList.add(512);
cIterator<int> it = integerList.getIterator();
while(it.hasNext())
{
int number = it.getCurrentObject();
printf("number is %d\n", number);
it.next();
}
The output would therefore be:
number is 100 number is 256 number is 384 number is 128 number is 512
The same output can be achieved with iterators and loops using direct access with varied results. Let’s compare them:
For-loop, linked list direct access - 45 iterations for 10 elements
Iteration - 10 iterations for 10 elements!
Do note that no doubt iterators are meant to iterate their respective data structures, it’s only meant to reduce the amount of latency for iteration operations to its possible minimum. On the other hand, a for-loop linked list direct access is considered to be grossly inefficient.
Linked List iterators also have another distinct advantage. That is, imagine a list of projectiles which you want to process and not access directly, but iteratively and has no limits. (Which is why linked list was chosen). Imagine that you’ve a need to remove a node in the middle of iteration, what do you do? Call a direct remove? Or remove while iterating?
The natural answer is of course : Remove while iterating!
The main reason is because, since the iterator cursor is at the desired element, it’s able to perform removal and patch links automatically with ease. Since it’s at the desired element, no further iterations are required thus making it ideal for on-the-fly iteration removal.
Of course, this feature must be implemented by the data structure and iterators (STL does it don’t worry ;p)
Binary Tree:
Finally, we’re at the last leg of iteration with binary trees. One thing to note, iteration with linked lists and arrays may be straightforward, it’s however so not the case for binary tree iterations. Why?
Binary trees do not have a direct path of linear traversal, unlike the previous two data structures.
Therefore, there’re 3 different ways we can iterate a binary tree.
1) Pre-Order Traversal
2) In-Order Traversal
3) Post-Order Traversal
While Pre and Post Order traversals are used for various reasons, the most common form of traversal is In-Order Traversal due to its natural ability to provide a sorted list during iteration.
How is it so? Binary trees naturally, by default, sort out its own data during insertion. That’s how we define who goes to the left branch and right branch respectively. Search times are therefore always reduced to O log base 2 (n). Since the traversal latency comes only from vertical traversal (down the branches) rather than the nodes.
However, it’s not all as rosy as it sounds. There’s a price to pay for sorting out the nodes afterall. That is, as you start moving down-leftwards from the root node, you’ll need to remember the parent(s) nodes that were stored in order to remember where you’ve visited but not processed yet.
To do so, it needs a First-In-Last-Out (FILO) algorithm and guess what? You’ll need a Stack! This stack can be limited or unlimited. Limited because the maximum ever your binary tree can store is 2^32, which at worst case requires a stack with a capacity of 32 elements only. (This is also unlikely, since 2^32 will yield 4.2 billion objects!)
Stack operations by default are at O( c ) per call for push, pop and peek. However, the combined Big-O weight of all these operations makes binary tree iteration much more heavier than linked list and arrays.
Moreover, the rule of thumb for In-Order traversal is to traverse to the leaf node on the left-most branch.
This traversal should be taken as the worst case performance and therefore cannot be O( c) anymore. Instead, it’s O log 2 (n).
Remember, we’re talking about the worst case performance here, therefore we mustn’t be soft on these algorithms. (So that we can safely measure application performance with some room to fall back on)
Conclusion
We therefore have come to the conclusion of this article. In the next article, which will be one of my last on programming before migrating over to Strides Octopus tutorials for games development.
This article has demonstrated the importance of iterations over loops as well as highlights on Big-O-Notation and its value in helping developers make decisions consciously.
Let’s end this with a few myths and truths!
Myth #1 : Do not use data structures explicitly, just use Collection interface will do.
Truth:
Don’t be stupid, your collection interface object is still based on one of the data structure implementations. If you’re given a Collection object, then you should just use it without type-casting. If you need a specific data structure or data structure type to be passed over, then you’ll need to let your co-developer know. In short, don’t stereotype, don’t be lazy, communicate more, assume less!
Myth #2 : I don’t need to care about what data structure to use, I don’t know and I don’t need to know.
Obviously if I have such a developer, I’ll be asking myself just one simple question : Why did I hire this person?!
A properly trained or well-taught developer will at least need to know the basics of data structure. At the very least, know their performance strengths and weaknesses and know where and when to use and apply.
A good developer is one who knows how to apply his knowledge. Having knowledge alone is useless. Flaunting is worse. Throwing weight around because of technical knowledge is the ugliest of all in the development world.
Myth #3 : Performance? Nah don’t optimize too early. Let the testers make noise before I take any action.
This is a potential recipe for a software disaster. Yes, do not optimize too early. But optimization is very general and vague in a sense. What’s conscious development and optimization? They’re both very different.
- Knowing how to apply doesn’t mean it’s optimization.
- Being conscious of what to use, when to use and what to avoid isn’t optimization! (ie, pass by reference isn’t really optimization, it’s EXPECTED unless it’s meant to be a copy)
Besides, waiting for the tester to make noise is a sign of escaping from reality. Yes, we all need a break, we all need to unwind, but there is no need to wait for a tester to report issues before you get onto it. Sometimes it’s better to prevent certain mistakes than to delay it till it’s too late or complicated for a proper fix. By then, regression hells and cross-linked broken fixes will destroy you completely.
That’s about it. Thanks for viewing!
Regards,
Jeremy
- Permalink
- Admin
- 15 Sep 2009 10:47 AM
- Comments (0)