Here we will discuss all the data camp queries and try to solve them please post your answer even if its wrong we will not dislike your answer like stackoverflow.
So let me start with the first one this is one of the queries I have received recently
I wrote functional scripts but couldn’t make the point in Datacamp.
I just can’t understand where the error is coming from, as for me the answer is correct and the output is what is expected.
The two cases are listed here below and screen prints are attached along with the message.
course: Introduction to datascience with Python
Chap 1 - getting started in python => last exercise (snooping for suspects)
Chap 3 - plotting data with matplotlib => last exercise (identifying Baye’s kidnapper)
First Bootcamper who solves this problem will get a Christmas chocolate from me
For the first issue ( Chap 1 - getting started in python => last exercise (snooping for suspects)), I don’t remember having this problem when I did the exercice for the first time, so I wonder if something has changed…
Anyway, it appears that script editor is being reset from instruction 1 to 2, and 2 to 3. So, to get it right when submitting the answer, I had to redo the previous tasks. For this 3rd instruction/final step:
defined plate, redoing instruction 1:
plate = ‘FRQ***’
called the look_up() funtion on plate (with no keyword arguments), redoing instruction 2:
called the look_up() funtion on plate, with key word argument color=‘Green’, following instruction 3:
And the submission was taken as correct.
So, when the feedback for incorrect submission pointed out “did you call lookup_plate() twice?”, it’s actually telling you you SHOULD call it twice.
For the 2nd issue, again: no memory of having his issue, and now I got the reported issue. Weird…
And again, the solution seems to be the same: redo the previous steps…
I got the submission accepted when I did that:
So I guess Datacamp wants too see the two lines in the plot, which makes sense. You probably don’t need to call the first plt.show(), a single call at the end of the script should be enough (haven’t tested it, though).
Looks like Rodolpho is clearly the winner of this Christmas chocolate
So what is your next query bootcampers please get involve share and win chocolates
Hi Varun and others,
How should I understand the conclusion provided by DataCamp after successful completion of exercise
In the plots generated during the exercise, the Isolation Forest algorithm seems to raise more false alarms with 5% of outliers (50 outliers):
than with 50% of outliers (500 outliers):
So it seems that the Isolation Forests algorithm generates more false alarms when the level of contamination is low (and not high as written in DataCamp’s conclusion). Am I right?
First Wish you a happy new year
Looking into your query it looks like as we keep pumping up the value of outliers the isolation forest is getting clumsier in detecting True Positives i,e Noise or Anomalies and it starts to detect more False Positives aka False Alarms that means “what was in reality a true data is now falsely detected as an anomaly”. This is what Data Camp is trying to say in there conclusion. I hope it helps, let me know if I have missed something…
Happy New Year to you too!
I tend to agree with the first part of your reply:
Looking into your query it looks like as we keep pumping up the value of outliers the isolation forest is getting clumsier in detecting True Positives i,e Noise or Anomalies.
However, I seem to be confused about what a positive case is. I am assuming here that a positive case = a dot classified as noise by the Isolation Forest algorithm. There are of course True and False Positives.
In the top plot where actual noise = 50 dark dots, the Isolation Forest algorithm detects more than 50 “noisy” dots. Yellow dots on the left hand side which are dark on the right hand side are False Positives. Here I am assuming False Alarm == False Positive. I visuallycannot spot any False Negative, so the recall should be close to 100% (but precision is much lower).
In the bottom plot where actual noise = 500 dark dots, the Isolation Forest algorithm detects less than 500 “noisy” dots. Noisy (dark) dots on the LHS which are colored in yellow on the RHS are False Negatives. There are indeed more False Negatives when more noise is added to the clean data, but to my understanding a False Negative != a False Alarm. I visually cannot spot any False Positive, so the precision should be close to 100% (but recall is much lower).
Does this make sense to you?
Hi jfb ,I am bit confused how are you calculating precision because when I checked the number of anomalies detected from the console–> when we increase outlier_number parameter from 50 -> 500 the true positives go down from 50 to 300 which is almost 40% decrease in precision hence when contamination(noise) increases the true positives decreases in our case but I totally agree with you in the case that we are actually underestimating the anomaly not overestimating it as the contamination increases. I also tried few other parameters like when number_outliers =1000 and I get true positives = 400 which is even worse in precision.
@ All other Bootcamper’s, please jump in if you have any suggestion or some other better explanation for the query of jfb, don’t forget chocolates are waiting for you
I feel there is a typo in the video https://campus.datacamp.com/courses/pandas-foundations/exploratory-data-analysis?ex=1:
To me, it makes no sense to plot sepal_length along the y axis and label it ‘sepal width’. The second line should be:
plt.ylabel('sepal length (cm)')
Do you agree?
Good catch jean
Everything you said makes total sense to me.
The exercice itself, does not (IMHO). In order to make datacamp’s statement correct (“more contaminated data makes it raise more false alarms”), you have to consider yellow dots == positive (anomalie, alarm) and dark dots == negative, which is the other around of what you have been considering. Which would be counter intuitive… and perhaps, would not make much sense.
Added noise in this case would mean adding non-anomaly events, thus adding negative events.
I guess the confusion is a result of not having a problem presented well enough (what are the measured variables, what is the regular/“normal” case, etc).
Thanks for your reaction, Rodolpho! Happy to read I am not the only one confused!
At the time, I sent Datacamp my comments about that exercise, to using the feedback button on the top right of the screen. Let’s hope they’ll do something out of it!