Unique Paper Code : 42343307 Name of the Course : B. Sc. Programme / B. Sc. Mathematical Science SEC-1 Name of the Paper : Data Analysis using Python Programming Semester : III Year of Admission : 2019 onwards

Duration: 3 Hours Maximum Marks: 75 Attempt any four questions. All questions carry equal marks.

1. Consider a list of values:

bag = [25,26,21,22,31,29,33,34,26,30,31,46]

• Import the appropriate Python libraries to create a ndarray called bag_weights having 3 rows and 4 columns from the list bag.

• Use Numpy library to display the mean, variance and median of the given data in bag_weights.

• Write a command to display the count of values greater than the median in bag_weights.

• Transpose bag_weights and then split it in two arrays bagA and bagB having 2 rows and three columns each.

• Sort bagA such that it brings the highest value of the row in the first column. Sort bagB such that it brings the lowest value of the row in the first column.

• Find the union and intersection of values in bagA and bagB.

2. Consider a list of values:

rate = [4.23,3.8,2.98,2.56,3,114,3.8,3.78,2.98,4.8,4.10,3.65]

• Import the appropriate Python libraries to create a one-dimensional ndarray called growth_rate from the list rate. Create another one-dimensional array named twos having the same number of elements as growth_rate, all set to 2.

• Use Numpy library to find the index of the maximum and the minimum values in the array growth_rate.

• What does a box plot show? Give a command to display a boxplot for growth_rate.

• Concatenate the two arrays growth_rate and twos, and reshape the resulting array to have four rows and appropriate number of columns, call it results.

• Find the mean, median, mode and standard deviation of each column in results.

• Write a command to store the array results to a file called result.npy on the disk in the current working directory.

3. Consider the following DataFrame (df):





Write suitable Python command(s) in Pandas library: • Display the number of rows and columns present in the DataFrame df? • Display the names of columns that have NULL values present in them, along with the count of NULL values. Replace the NULL values present in the column with the lowest value in that column. • Create a new column in df named Rating, which contains the mean of User_rating and Critic_rating. Create another column, Profit, which contains the difference of Gross_collections and Budget.

• Find the correlation between Budget and Rating. Based on the correlation values between two variables, what inference(s) can be drawn about the relationship between them?

• Group the movies according to the Director_name. Find the most profitable director.

• What does a contingency table depict? Write commands to display the contingency table between Director_name and Language.

Q4. Consider a dictionary:

dict1 = {Chhetri: 80, Shabbir: 23, Gouramangi: 6,

Subrata: 92, Vijayan: 29, Gawli: NULL, Nabi: 7,

Renedy: 4, Lalpekhlua: 23, Baichung:41, Surkumar: 2}

Write suitable Python command(s) in Pandas library:

• Create a Pandas Series for the dictionary dict1 where the key is name of the footballer and the value is the number of goals scored by him. The Series should have the names of the footballers as its index and values as goals scored.

• Display the names of Footballers who have scored more than 20 goals.

• Due to the good performance of top six footballers, their rankings have increased and the number of goals scored by them need to be increased by 25. Round the resulting value to the nearest integer equal to or more than the computed number of goals. Update the Series to reflect these changes.

• Include a 12th man named 'Mondal' in the above Series whose number of goals scored is not known.

• Display the list of Footballers whose number of goals scored is NOT NULL.

• Due to injury, 'Shabbir' was replaced by 'Sandhu' who number of goals scored is 5. Reflect this change in the Series and display the new Series.

Q5. The first few rows of the standard iris dataset in the sklearn library are given below:

• Import the appropriate Python libraries to load the dataset. Create a Pandas DataFrame named iris having all the columns in the dataset.

• Use an appropriate command to display a summary of the vital statistics of all numerical and categorical attributes in iris.

• What is the role of pre-processing in data analysis? Discuss how will you choose between (a) deleting the rows containing missing values or (b) replacing the missing values in a column with the mean or (c) replacing them with the mode of the column.

• Give a Pandas command to convert the categorical attribute, species into dummy variables. Display all the columns of the DataFrame including the dummy variables. Give a command to drop the column species from the DataFrame.

• Draw a scatterplot between the columns sepal length and petal length for the species setosa in iris.

• Create 5 equal length bins for each of the two columns sepal length and sepal width. Draw two histograms, one each for the values of sepal length and sepal width in these bins in a single figure. Save this image in a file on the hard disk.



Q6. Consider the details 15 rubies as follows:



• Import the appropriate Python libraries to create a Pandas DataFrame named rubies having the above columns. The columns and rows of the DataFrame should have appropriate names.

• Draw box plots for all numerical columns of the dataset in the same chart. Display the median of all numerical attributes in rubies for each type of cut.

• Display the per carat average price of all rubies grouped by the two attributes clarity and color.

• Normalize all quantitative features in range of [0,1].

• Draw word cloud for attribute cut.