STAT 7008 – Assignment 3

STAT 7008 – Assignment 3
Deadline of Submission by midnight 19 Nov 2018
All questions in this assignment must be solved or answered by writing Python programs.
Total Marks 100
Question 1: Reading pdf (36 marks)
The file 57070_CampbellSoup_Investor_Spread.pdf is a financial report of the Campbell Soup company.
(a) Write a Python code to identify the page in which the Consolidated Statements of Cash Flows is located.
(b) Write a Python program and use appropriate regular expressions to convert the Consolidated Statements of Cash Flows to a Pandas DataFrame.
Question 2: Identifying undervalued stocks (44 marks)
(Use the three sets of Panda Notes to write your codes)
The main objectives of this question are to solve the followings:
1. The website http://finviz.com/ screener provides a comprehensive list of variables for 7,541 listed companies. We are interested in downloading these information provided in the website into a pandas data frame for further analysis. The links we are going to download the variables begin with
'https://finviz.com/screener.ashx?v=152&r=a suitable number &c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69'.
Observe how the links can be constructed by supplying the set of suitable numbers to the above link.
Example codes to download a table from each link and then combine them into a large pandas data frame is given below:
import pandas as pd
import numpy as np
from pandas.io.parsers import TextParser
from numpy import nan as NA
from lxml.html import parse
from urllib.request import urlopen
def _unpack(row, kind='td'):
elts = row.findall('.//%s' % kind)
return [val.text_content() for val in elts]
def parse_options_data(table):
rows = table.findall('.//tr')
data = [_unpack(r) for r in rows]
header = data[6]
data2 = [data[i] for i in range(7,len(data)]
return TextParser(data2,names=header).get_chunk()
#tailurl is the set of suitable numbers
#y is the tail of the link
baseurl = 'https://finviz.com/screener.ashx?v=152&r='
df = pd.DataFrame()
for x in tailurl:
parsed = parse(urlopen(baseurl+x+'&c='+y))
doc = parsed.getroot()
tables = doc.findall('.//table')
pdf = parse_options_data(tables[6])
df = pd.concat([df,pdf], ignore_index=True)
print(x+' is completed')
A similar codes can also be found in the notes Data Loading and Storage with Pandas and Pandas Data Wrangling, Aggregation and Group Operations.
2. You may find that there exists some rows in the df dataframe consisting a lot of NaN. Remove those rows.
3. Remove the column 'Earnings'.
4. The 6th to the last columns are all in char format which contains 'B', 'M', 'K', '%', '-' and ','. Write a function to clean the data and convert all to float or int format whichever is appropriate.
5. Obtain a histogram of stock prices using the code
df. .hist(bins=100,alpha=0.3,color='k',normed=True).
However, the graph consists of one bar which is not normal given that we have over 7,000 stocks. So we consider only stock prices less than 150 and re-produce the histogram.
6. Obtain a horizontal bar chart of the average prices per Sector.
7. Obtain a horizontal bar chart of the top 30 average prices of the top 20 priced stocks per industry.
8. Obtain a horizontal bar chart of the average prices per financial industry. Ignoring the largest industry.
9. Since the industry property casualty insurers has the highest average price in the finance sector, obtain a horizontal bar chart of the top 50 highest selling stock prices of property casualty insurers.
Ignoring the largest one.
10. Create variables to locate stocks which sells below their sector averages on PE, PEG, PS, PB and Price respectively.
11. Create variables to locate stocks which sells below their industry averages on PE, PEG, PS, PB and Price respectively.
12. Question 9 and 10 altogether define 10 simplifying criteria for an undervalued stock. Create an index to determine the number of criteria each stock satisfies. We call this index a relative_value_index.
13. Besides the relative_value_index, suppose that other criteria for identifying an undervalued stock are as follows:
a) Price per share is between $20 and $100
b) Volume must be greater than 10,000
c) Positive earnings per share and positive projected earnings per share
d) Total debt to equity ratio less than 0.5
e) Beta less than 1.5
f) Institutional ownership less than 30 percent
g) Relative valuation index values greater than 8
Identify stocks in the dataset that satisfies the stated criteria.
Question 3 understanding and revising alien codes (20 marks)
The website https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/ describes a program which can read the numbers shown on a credit card.
Basically the steps are as follows:
1. Given images of the digits 0,1,2,3,4,5,6,7,8,9, we change to gray scale and cv2 has a function to identify contour of these digits. The contour of these digits act as a set of reference contours.
2. Since the digits are appeared in a group of 4 in a rectangular box and each digit is in a square box, we specify dimension of a rectangular box and that of a square box. These dimensions help to identify positions where the group of digits are.
3. Given a credit card image, the strategy is to change it to a gray scale. Since the digits are in a bright color, cv2 has a set of functions to spur the image so that areas of continuous light color that conforms to the given box dimensions can be identified.
4. For each identified area, use the cv2 contour function again to identify contour of digits in the area.
5. For each digit from left to right, scores are given in a template matching to the reference contours to find the matching digit.
A program ocr_template_match.py, digit images OCR-A_reference.png and the credit card image card_image.png are given for your study.
Unfortunately, the program cannot read AMEX card Hilton-honors1.png. A special set of reference digits AMEX_reference.png is created. This exercise is to change the existing program to read this AMEX card.