data编程辅导 、讲解 java语言程序
            
                Task Description: Implementing a HyperLogLog Sketch and bSktHLL Data Structure
Objective
The goal is to develop two classes to estimate the cardinality (flow spread) of network flow
data using the HyperLogLog algorithm and a specialized data structure comprising an array
of HyperLogLog sketches. The input data (o1.txt) consists of source and destination IP pairs.
The output should be a CSV file containing the estimated cardinality for each unique
destination IP, along with the true count of occurrences.
Requirements
1. HLL Class: Implement a class named HLL to represent a HyperLogLog sketch. This
class should include:
• A constructor that initializes the number of registers (m, default 128) and
register bit size (p, default 5).
• A method add for adding elements to the sketch, which involves hashing the
element, updating the appropriate register based on the hash value, and
tracking the maximum number of leading zeros.
• A method estimate to calculate and return the estimated cardinality based
on the register values.
2. bSktHLL Class: Implement a class named bSktHLL that contains an array of l HLL
sketches (l=3 as each sketch produce a candidate estimate). Each sketch should
have m=128 HLL registers with each register being 5 bits. This class should include:
• A constructor that initializes the array of HLL sketches.
• A method record to hash each flow into two different HLL sketches and
update the sketches accordingly.
• A method query to return the median estimate of the cardinality for a given
flow from the sketches it hashes into.
3. Data Processing Script: Write a script to:
• Read network flow data from a text file, where each line contains a source
and destination IP pair.
• Process each line to record the destination IP in the bSktHLL data structure.
• Generate a CSV file with the following columns: flow_label,
candidate_estimate_1, candidate_estimate_2, candidate_estimate_3,
true_spread. The candidate_estimate columns should contain the
cardinality estimates, and the true_spread should contain the actual count
of occurrences for each destination IP.
Notes: The error in the candidate estimation should not exceed 5% of the actual spread.
References: Assistance functions for recording and querying flows can be sourced from
HyperLogLog.java, while references for hash functions should be utilized from
GeneralUtil.java.