首页 > > 详细

讲解 INT303 Big Data Analysis - Coding Project 2: Loan Approval Prediction调试Python程序

INT303 Big Data Analysis - Coding Project 2: Loan Approval Prediction

Weightage: 100   points(30%of  total  course  grade)

Due  Date:12   DEC

Submission: Submit  your  Jupyter  Notebook(.ipynb)    and a concise  1-2 page executive  summary  report  via  [Learning  Mall/Submission  Portal].

1.Introduction

In  the world  of finance,accurate  and  efficient  loan  approval  decisions  are

paramount.Banks  and  financial  institutions  rely  on  robust  data  analysis  and

predictive  models  to  assess  applicant  creditworthiness,mitigate  risks,and  optimize their  lending  portfolios.This  project  challenges you  to step  into the  role  of a  Data    Scientist  at  a  burgeoning  financial  technology(FinTech)firm.Your  task   is  to

develop a machine learning model that predicts whether a loan application will be approved or rejected based on a comprehensive set of applicant data.

This  project aims to solidify your understanding of the entire  machine  learning pipeline,from  exploratory  data  analysis  and  preprocessing  to  model  building,   evaluation,and  interpretation.You  will  be  provided  with  a  dataset  containing   various  applicant  attributes  and  their  corresponding  loan  approval  status.

2.Project Objectives

Upon  completion  of this  project,you  should  be  able  to:

Perform  comprehensive  Exploratory  Data  Analysis(EDA)to  understand  data distributions,identify   potential   issues,and   derive   insights.

Implement    effective     data      preprocessing    techniques,including      handling missing  values,encoding  categorical  features,and  scaling  numerical  features.

Engineer   new,meaningful   features   from   existing   ones   to   enhance   model performance.

Select  and  implement  appropriate  machine  learning  models  for  classification tasks.

·               Evaluate  model  performance  using  various  metrics  and  techniques.

·               Interpret   model   results   and   explain   the   factors   influencing   loan   approval decisions.

Present your findings clearly and  professionally  in a technical  report.

Demonstrate   proficiency    in    Python   programming    for   data    analysis   and machine  learning.

3.Dataset

You will  be working with  a  dataset  named  loan_approval_dataset_copy.csv(a

sample   of   the    "architsharma01/loan-approval-prediction-dataset").This    dataset contains  the  following  columns:

loan_id:Unique  identifier  for  each  loan  application.

·               no_of_dependents:Number  of  dependents  the  applicant  has.

·               education:Applicant's    education    level(Graduate/Not   Graduate).

·               self_employed:Whether  the   applicant  is  self-employed  (Yes/No).

·               income_annum:Applicant's   annual   income.

loan_amount:The  requested  loan  amount.

loan_term:The duration of the  loan  in years.

·               cibil_score:Applicant's   CIBIL   credit   score(a   creditworthiness   indicator).

·               residential_assets_value:Value  of  residential   assets.

commercial_assets_value:Value  of  commercial  assets.

luxury_assets_value:Value  of  luxury  assets.

·               bank_asset_value:Value  of  bank  assets.

loan_status:The  target  variable,indicating  whether  the  loan  was  'Approved' or  'Rejected'.

Note: The provided CSV is a small sample.Assume you are working with a larger, more realistic version of this dataset where you may encounter missing values,

outliers,and  varying  data  distributions.Your  solution  should  be  scalable  and  robust enough to  handle  such  real-world  scenarios.

4.Project  Tasks

Your submission should  include a well-commented Jupyter  Notebook and a

separate   executive  summary   report(PDF)summarizing   your  approach,findings, and   recommendations.

Task      1:Exploratory     Data      Analysis(EDA)and     Data      Preprocessing(30     points)

1.                      Load  and  Initial  Inspection:  Load   the  dataset  into  a  Pandas  DataFrame. Display    the    first    few     rows,check    data    types,and     identify    missing    values. Summarize  key  statistics.

2.             Univariate Analysis: Analyze   the   distribution   of  each  feature.For   numerical features,create    histograms   and    box    plots.For    categorical   features,create    bar plots.Describe   your   observations.

3.             Bivariate Analysis: Explore    the    relationships    between   features,particularly their     relationship    with     the     loan_status    target      variable.    Use    appropriate  visualizations(e.g.,scatter    plots,stacked    bar    plots,heatmaps).

4.             Data Cleaning:  Handle   any    identified   missing   values,outliers(if   present),or inconsistencies.Justify  your  chosen   methods.

5.             Feature Engineering: Create  at  least  two new,meaningful   features   that   you believe  could  improve  model  performance.Explain  your  rationale.

6.             Categorical   Encoding:   Convert    all    categorical    features    into    numerical representations   suitable    for   machine    learning   models    (e.g.,One-Hot    Encoding, Label  Encoding).

Feature Scaling: Apply    appropriate   scaling   techniques(e.g.,StandardScaler, MinMaxScaler)to   numerical   features.

 

Task    2:Model     Development     and     Evaluation(40    points)

Data Splitting: Split  your  processed  data  into  training  and  testing  sets(e.g., 70%training,30%testing).

Model Selection: Choose  at  least  three different classification algorithms.

Good  candidates  might  include:

                                      Logistic   Regression

Decision  Tree  Classifier

                                  Random  Forest  Classifier

Gradient    Boosting    Classifier(e.g.,XGBoost,LightGBM) Support  Vector  Machine(SVM)

K-Nearest     Neighbors(KNN)

Model Training: Train your chosen  models on the training data.

Hyperparameter Tuning: Implement  a  strategy  to  tune   hyperparameters  for each     selected     model     (e.g.,GridSearchCV,RandomizedSearchCV).Explain     why hyperparameter  tuning  is  important.

Model Evaluation: Evaluate the  performance of each tuned  model on the test set  using  various   metrics.At  a   minimum,include:

Accuracy

Precision,Recall,F1-score   (for   both   'Approved'and   'Rejected'classes) ROC AUC Score

Confusion   Matrix

Provide a comparative analysis of the  models  based  on these  metrics, considering  the  business  context(e.g.,what  kind  of  errors  are  more  costly  for  a bank?).

6.             Feature  Importance  (if  applicable):  For     tree-based    models,analyze     and visualize   feature   importance.Discuss   which    features   your    model   deems    most crucial  for  loan  approval  prediction.

Task    3:Executive    Summary    Report(20    points)

Write a  1-2  page  executive  summary  report  (in  PDF  format)that  addresses the following:

1.                      Introduction:   Briefly  state the  problem  and the objective of your  project.

2.             Methodology:   Summarize  your  data   preprocessing  steps,feature  engineering choices,and  the  models  you  experimented  with.

3.             Key   Findings: Present  the  performance  of  your  best  models  using  relevant metrics.Discuss  the  most  important  features.

4.             Recommendations     &Insights: Based  on  your  analysis,what  insights  can  you provide   to   the   FinTech   firm   regarding   loan   approval?Which   model   would   you recommend   and  why?Suggest   potential   improvements   or   next   steps   for   future work.

5.             Ethical        Considerations(Bonus,5         points):Briefly     discuss     any     ethical considerations  related  to  building  and  deploying  such  a  loan  approval  model(e.g., bias,fairness,transparency).

Task   4:Code   Quality   and   Documentation    (10   points)

1.                     Code   Readability: Your   Jupyter   Notebook   should   be   well-structured,logical, and easy to follow.

2.             Comments:  Include  appropriate  comments  to  explain  complex   logic,choices, and  reasoning.

3.             Reproducibility:    Ensure   your   notebook   can   be   run   from   top   to   bottom without  errors  and  produces  consistent  results.

 

联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!