Application of big data technology in enterprise information security management

Table of Contents

The BDT framework

The BDT framework is a collection of tools and techniques for processing large-scale data^42,43. In this study, a comprehensive BDT framework based on the Hadoop ecosystem is carefully constructed, which consists of four interrelated and progressive core components, aiming to achieve efficient management and intelligent analysis of large-scale security-related data. The specific architecture is displayed in Fig. 1:

Figure 1presents the deployment of a distributed log collection system in the data collection phase. They act as a real-time data pipeline to continuously capture and transmit raw security event data from diverse internal information systems, network devices, server logs, and other security devices within the enterprise. Following this, to meet the storage requirements, the Hadoop Distributed File System (HDFS) is chosen as the underlying large-scale distributed storage infrastructure^44,45. HDFS not only accommodates massive data but also ensures data’s high availability and reliability through redundant replication. Additionally, for complex-structured non-structured and semi-structured data, NoSQL databases are further utilized. Leveraging their flexible data model and efficient column-family storage characteristics, they meet various data management needs in different scenarios.

In the data processing phase, two complementary technical paths are employed to address diverse business scenarios. On one hand, the classical MapReduce programming model is used for batch processing jobs, excelling in partitioning large datasets into independent subtasks and parallel execution of data cleaning, transformation, and preliminary analysis on cluster nodes⁴⁶. On the other hand, considering scenarios with high demands for real-time and low latency, the Apache Spark framework is introduced to enable real-time response and dynamic analysis of security events.

Finally, in the data analysis phase, a range of advanced tools and technologies are fully utilized to uncover potential patterns of security threats. For instance, Apache Hive provides a SQL-like query interface for structured querying and statistical analysis of processed data, revealing patterns and trends hidden within vast amounts of data. Simultaneously, Apache Pig offers a high-level data flow language, simplifying the writing and execution of large-scale data processing scripts. Crucially, Spark’s built-in MLlib library and mature tools like Mahout are employed for deep learning and ML modeling of pre-processed data. Thus, it develops precise threat detection and prediction models, effectively enhancing the recognition and prevention capabilities against network security threats. Through the BDT framework, enterprises can achieve efficient processing and analysis of massive data, enabling real-time monitoring, threat warning, and rapid response in ISM.

The architecture of the ISM model

To explore in depth how the BDT framework supports EISM, the following is a detailed analysis of the ISM model architecture. This model architecture employs BDT to improve the IS level of enterprises, mainly through real-time monitoring, abnormal behavior detection, and risk prediction. The specific model architecture is presented in Fig. 2. In the ISM model architecture, risk identification and feature extraction are the core links, which run through the data and processing layers. They can provide strong support for subsequent security management through the integration and intelligent analysis of multi-source data.

In Fig. 2, a closed-loop is formed between the layers through data flow and feedback, where the upper layers depend on the data and analysis results from the lower layers to drive the entire security management process.

This study utilizes data sources such as “system logs,” “network traffic records,” “user behavior,” and “external threat intelligence” to support threat detection. (1) System logs leverage log information generated by operating systems, applications, databases, etc., to monitor and analyze system activities in real-time, identifying potential anomalous behaviors and security vulnerabilities. System logs can provide early signs of attack behaviors, such as illegal login attempts, unauthorized access, and more. (2) Network traffic records analyze the incoming and outgoing data packets through real-time monitoring of network traffic. Special attention is paid to abnormal traffic and data transmission patterns, identifying potential DDoS attacks, network scanning, and other malicious activities. Traffic records assist in detecting security threats at the network layer through feature analysis and behavior recognition. (3) User behavior data, such as user login activities, access permission changes, and file operation records are used to analyze account anomalies. These data help identify whether users exhibit abuse of permissions, abnormal logins, and other behaviors, and through behavioral analysis models, determine if there are potential internal threats. (4) External threat intelligence data provides information on the latest security vulnerabilities, malicious attack activities, and their patterns, aiding in cross-domain threat detection. By integrating this intelligence with internal system data for comprehensive threat analysis, the accuracy and timeliness of detection are enhanced. These multidimensional data are converged through a distributed collection framework, achieving comprehensive awareness of the system’s operational status and ensuring the timeliness and integrity of the data.

In the data preprocessing sub-layer of the processing layer, the initial step involves cleansing and optimizing raw data to ensure its suitability for subsequent analysis requirements. Data preprocessing includes multiple stages, beginning with filtering the data to remove redundant information unrelated to security events, thereby reducing the interference of noise on the model’s analytical results. Subsequently, denoising techniques are applied to eliminate unnecessary fluctuations in the data, ensuring stable data quality and enhancing the accuracy of subsequent analysis. Additionally, the preprocessing process involves data format conversion and standardization to better integrate data from different sources. Standardized data can be unified into specific structures or formats, making it more suitable for use in various models and algorithms, ensuring the compatibility of multi-source data.

After preprocessing, the data enters the feature extraction sub-layer. At this stage, the system extracts features from the processed data that are meaningful for the detection of security threats. These features include, but are not limited to, anomalies in user login patterns, changes in access frequency, and anomalies in resource access patterns. These features can reveal potential threats in system operations. For instance, unusual login times or locations may indicate misuse of user accounts, and frequent access requests may be signs of system scanning or penetration testing. Feature extraction helps enhance the model’s precision in identifying potential threats by recognizing high-value security information from large volumes of raw data.

Once feature extraction is complete, the process moves to the model training and evaluation sub-layer. In this sub-layer, ML and data mining techniques are utilized to construct a security threat detection model. Two ML algorithms. Support Vector Machine (SVM) and Random Forest (RF), are employed. SVM is a supervised learning algorithm particularly adept at classifying high-dimensional data. In this study, SVM is used as the primary classifier. The fundamental concept is to find an optimal hyperplane (or, in non-linear cases, map data into high-dimensional space using kernel functions) that can effectively distinguish between normal and abnormal behaviors (i.e., potential security threats). The advantage of SVM lies in its ability to handle high-dimensional data and its strong generalization abilities, making it especially suitable for situations with high feature dimensions. During the training phase, SVM utilizes historical data (such as system logs, network traffic records, etc.) to learn the characteristics of security events and implement a classification model. The trained SVM model can promptly identify potential security threats with new data inputs.

RF is an ensemble learning method that builds multiple decision trees and determines the final classification result through a voting mechanism. RF is used in this study to improve the model’s robustness and to select the most discriminative feature among multiple security event features. RF algorithms can effectively handle large datasets, especially in the face of noisy or redundant characteristics, and can reduce overfitting by integrating the results of multiple trees. Through continuous random sampling and feature selection, RF can extract information useful for threat identification from high-dimensional complex security event data.

In the process of model training and evaluation, firstly, historical data (such as system logs, user behavior data, network traffic records, etc.) are used to train the two models. The data preprocessing and feature extraction sub-layers ensure the quality and validity of the input data, after which the threat detection model is trained using SVM and RF algorithms. The training process continuously optimizes the model’s hyperparameters (e.g. kernel type of SVM, number of trees of RF, etc.) to ensure that the model can adapt to different security threat scenarios. In the model evaluation phase, many performance indicators (such as accuracy, recall, F1 score, Area Under the Curve (AUC), etc.) are employed to evaluate the model’s detection effectiveness. For example, by comparing the performance of SVM and RF models on different datasets, these two algorithms’ advantages and disadvantages can be analyzed in specific scenarios, thus adjusting and optimizing the model. In the process of continuously optimizing the algorithm and adjusting the parameters, the model gradually improves the detection ability of security threats. Furthermore, the model can make timely and accurate responses in the face of new and complex security events.

By integrating both SVM and RF algorithms, this study not only achieves stable performance across various security threat detection scenarios but also effectively addresses the evolving security threats. The results of model training and evaluation are passed to the decision layer, where they enter the real-time monitoring subsystem for immediate analysis. Within this subsystem, the system conducts statistical analysis of the output results from the processing layer to monitor anomalies in networks and systems in real-time. The intelligent alert engine can detect and trigger security threat alarms promptly based on predefined thresholds. Once abnormal behavior or attack activities are detected, the system immediately sends alerts to security administrators to take corresponding defensive measures. During this process, the system can also dynamically adjust security policies. For example, based on historical analysis results and real-time security intelligence, the system can automatically adjust firewall rules, access control lists, etc., to counter current or potential threats. In addition to these preset security measures, the system enhances overall defense capabilities through optimized security policies.

Furthermore, the decision layer includes an emergency response mechanism to handle security incidents that occur. Once a security threat is confirmed, the system can automatically initiate response procedures based on the type and severity of the security event. These procedures include isolating affected systems, implementing repair measures, and even tracking attack paths to determine the source of the attack and its propagation methods. This mechanism can be activated automatically or manually, ensuring that in the event of a security incident, it can be handled swiftly and effectively, minimizing potential losses and impacts. Through this dynamic response mechanism, enterprises can maintain system security and stability in the face of changing security threats, ensuring timely responses to various complex security challenges. Through the organic combination and seamless connection of these links, the system can realize real-time response, efficient defense, and monitoring of IS threats. Thus, it can enhance overall security management capabilities and efficiency in dealing with complex threats.

Ultimately, the visualization and interaction layers are responsible for presenting the final results. The visualization display subsystem uses forms such as data dashboards, heatmaps, and network topologies to transform complex analysis results into a visual interface that is easy to understand and operate. The interaction operation subsystem allows security administrators to customize queries, filter, and drill down for in-depth analysis through the interface, and formulate, execute, and adjust security protection strategies based on visual results. Through this model architecture, enterprises can more effectively utilize BDT to enhance the efficiency and effectiveness of ISM, enabling rapid identification, accurate assessment, and effective response to IS threats.

Threat prediction and management optimization of Big Data-driven EIS

This study adopts a diversified strategy in enterprise management optimization methods, cleverly integrating both quantitative and qualitative analysis research paradigms to comprehensively explore and analyze the practical effectiveness and challenges of BDT in EISM. The quantitative analysis applies statistical principles and ML algorithms to deeply mine large-scale security-related data. By constructing a security threat prediction model, specifically a logistic regression model^47,48,49, this study predicts the probability of security events occurring. The equation for the prediction model is as follows:

$$\:y=f(x;\theta\:)$$

(1)

y represents the probability of the predicted occurrence of a security event. The feature vector x includes a series of attributes related to security threats, such as user login behavior, network traffic features, system state parameters, etc. θ is the learning parameter of the model, obtained through the training process to achieve the optimal solution, reflecting the mapping relationship between features and predicted results. The function f is the ML model’s specific expression. The ML model’s training process is illustrated in Fig. 3.

Qualitative analysis focuses on gaining a deep understanding of the causes, mechanisms, and contexts behind phenomena. In this study, industry experts and frontline security management personnel are invited for in-depth interviews to understand how they apply BDT in their practical work, the challenges they face, and their evaluations of existing solutions along with improvement suggestions. In the selection of experts, this study pays special attention to the industry background and practical experience of the experts. Experts are selected based on the following criteria. First, senior security executives with extensive experience in the ISM field are selected, including security architects, system security engineers, and heads of IS departments with more than a decade of industry experience; Second, academic experts who have made outstanding contributions to the research and development of safety technology are invited. These experts have in-depth research in the application of BDT technology and the construction of safety management system. To ensure that the selected experts can provide representative feedback, all experts come from industries with highly complex security requirements, including finance, manufacturing, and information technology. This diverse selection of experts ensures a comprehensive understanding of the application needs and challenges of BDT in different industries.

In the process of understanding the insights of experts, a semi-structured interview questionnaire is designed, and the interview content revolves around the following aspects. Firstly, experts evaluate the application of BDT technology in practical ISM, and explore how they combine BDT for security threat detection, risk assessment, and decision support in their work; Secondly, experts share specific challenges they have encountered in applying BDT, such as the complexity of data processing, the difficulty of real-time analysis, and compatibility issues with existing security management systems. Finally, the experts propose suggestions to improve the existing BDT technology solutions, encompassing how to improve the efficiency of data processing, enhance the ability of model prediction, and customize the solutions in different industries. To ensure the depth and breadth of the interview, the interview time of each expert is limited to about one hour. This can ensure that the experts are free to express their views, and provide specific cases and personal experience in real work.

Additionally, in terms of case analysis, representative enterprises from various industry sectors are selected for detailed study. The selected cases span multiple domains including manufacturing, financial services, and information technology, where these enterprises have high demand and practical experience in ISM. By analyzing these enterprise cases, the study examines how they apply BDT technology in actual operations. Concurrently, it focuses on the difficulties encountered, resolution strategies, and outcomes during the implementation process, thereby distilling general security management strategies applicable to a wide range of industries. The insights from these experts and the analysis of enterprise cases provide not only practical insights for this study but also help further optimize the BDT model. Thus, it ensures it better meets the security management needs of different fields. Firstly, the experts’ suggestions help identify practical obstacles in applying BDT in the security management process, providing valuable references for model optimization and subsequent analysis; Secondly, the improvements suggested by the experts play a direct role in adjusting and expanding the model’s functions, ensuring the research outcomes are closely aligned with industry demands.

link