Automatic bleeding detection in laparoscopic surgery based on a faster region-based convolutional neural network
Introduction
As one of the major advances in the medical field, endoscopic technology has been widely and deeply applied in surgical operations, which at the same has presented several challenges including unwanted vessel injury, reduced field of vision, and decreased reduced maneuverability. Bleeding is a common but nonnegligible complication of operations, especially in laparoscopic surgeries, which could cause serious consequences or even become life-threatening without timely treatment (1). As the operation field is always narrow through the laparoscopic monitor and the blood floods the bleeding point quickly, it is difficult to stanch bleeding in a timely manner during laparoscopic procedures. Figure 1 shows the 2 common bleeding types, arterial bleeding and venous bleeding. Generally, arterial bleeding is more difficult to stanch than venous bleeding. A blood pool will form immediately if the arterial bleeding is not swiftly stanched (Figure 1C). Thus, instant and exact detection of the bleeding points is critical to the success of surgery.
In the past years, some efforts have been made to expediate blood flow and bleeding detection in endoscopic operations. Garcia-Martinez et al. proposed a computer vision algorithm to automatically detect hemorrhages by employing the parameter blue/red (B/R) and green/red (G/R) of the RGB space, but the accuracy was affected by poor illumination, visual interferences, and sudden camera movement (2). Okamoto et al. classified pixels into blood or non-blood pixels based on color features via a support vector machine (SVM) to identify the blood regions, such as a combination of RGB and hue, saturation, value (HSV) values (3). Fu et al. proposed a method based on color features and neural network to detect bleeding region, and further grouped pixels through superpixel segmentation, extracted superpixel features, and fed them into SVM for classification (4,5). Jia et al. extracted features automatically via deep convolutional neural network (CNN) and subsequently classified the images into bleeding and non-bleeding (6). To avoid unwanted vessel resection, Casella et al. also attempted to segment kidney vessels from nephrectomy laparoscopic vision based on adversarial fully convolutional neural networks (FCNNs) (7).
The above studies have shown the key roles of feature extraction in experimental design, but most of them were limited to extracting and analyzing information from a single image, which might restrict their further application in real-time videos. The major aim of our study was to provide a hemostasis support system for laparoscopic videos, which could automatically indicate and keep track of the location of the bleeding point once the bleeding occurs. Our work was generally divided into 2 steps: detecting bleeding points and tracking them.
Challenges arose at the very beginning of the first step. In the spatial dimension, point-like blood could be clearly defined, but it was difficult to define the event bleeding in the spatial dimension. Specifically, we might have been able to judge whether there was blood based on an image, but it was difficult to accurately assess whether there was an active bleeding event according to a single image. Therefore, we inferred that temporal dimension information should be considered at the same time. Looking forward (the previous image) or backward (the next image) was needed to help further assess whether the blood area had changed or not. In general, active bleeding would be characterized by increasing area of blood. Thus, spatial and temporal features were required to be analyzed simultaneously. In this study, we introduced optical flow to extract temporal features as well as the faster region-based convolutional neural network (RCNN) for spatial features, and consequently a spatiotemporal hybrid model was developed based on the faster RCNN to locate bleeding points (8). We present the following article in accordance with the MDAR reporting checklist (available at https://atm.amegroups.com/article/view/10.21037/atm-22-1914/rc).
Methods
Dataset
To acquire better performance and avoid being trapped in overfitting situation, large quantities of data were needed to train the deep learning model. We selected 12 bleeding video clips from 10 laparoscopic surgeries conducted at Peking Union Medical College Hospital, and sample images were extracted from the clips at 25 frames per second (fps). Finally, we obtained 2,665 images with 1,920×1,080 pixels, and 1,339 of them contained bleeding events. The ground-truth areas of bleeding point were marked by 2 senior surgeons using the visual object tagging tool (VoTT; Microsoft, Redmond, WA, USA) (9), and the center of marked box was regarded as the bleeding point. A more accurate point between the 2 marked centers was chosen as the picked point, and a 400×400 pixels box centered on the picked point was generated as the ground-truth.
Preprocessing
Data augmentation
To obtain more data for deep learning training, data augmentation was performed for our dataset. We applied 2 methods to realize the augmentation. Firstly, gamma correction was used for nonlinear change of images brightness. Due to the information redundancy of adjacent frames, we selected gamma values from 0.2, 0.4, ..., 1.6, to 1.8 in turn. This method not only expanded the original data, but also kept the model training speed constant. Secondly, the images were randomly flipped, including horizontal flip, vertical flip, and both horizontal and vertical flip. Figure 2 shows the sample results of data augmentation.
Optical flow
Optical flow is the instantaneous velocity of the pixel motion of a moving object on the viewing image plane. The method based on this concept was applied to find the corresponding relationship between the previous frame and the current frame according to the changes of pixels in the time domain as well as the correlation between adjacent frames in the image sequence, so as to calculate the motion information of objects between adjacent frames.
There are now many ways to obtain an optical flow map. The flownet solved the optical flow estimation problem as a supervised learning method, which meant that the ground truth optical flow data was required. A large synthetic and unrealistic Flying Chairs dataset was generated to obtain optical flow data (10). With continuous development, models like flownet are comparable to traditional methods for optical flow estimation. As a result, we chose flownet to integrate with our model. During the training phase, we first used the flownet to calculate the optical flow map we needed, and during the application process, we connected the flownet to 1 of the inputs of our model, in other words, the output of the flownet was used as an input of our model. The LiteFlowNet was finally used in this study, as it performed on par with FlowNet2, while reducing model size by 30 times and running 1.36 times faster (11,12).
Study methods
The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013) and approval was granted by the Ethics Committee of Peking Union Medical College Hospital (No. S-K1902). Informed consent was provided by all study participants. Significant improvements in object detection have been achieved by the RCNN method, which combines region proposal methods in a single image (13). The method first generates region proposals using selective search, then rescales these region proposals to a fixed size, then uses a CNN model to extract features, and finally uses these features for category classification and box regression. Fast RCNN improves RCNN by introducing a region of interest (ROI) pooling and is accelerated by adding a region proposal network in place of the selective search proposal method (8,14). Region proposal network (RPN) predicts the latent object boxes, and the objectness score represents the probability that an object is in the box at each position.
As shown in Figure 3, our model is derived from the extension of faster RCNN with optical flow input, which takes as input an RGB frame and an optical flow map extracted from the same frame and its adjacent frame. In particular, optical flow map only has 2 channels and our next feature extractor accepts 3 channels of input, so we transformed the optical flow map to 3 channels by stacking the x-component, the y-component, and the magnitude of the flow (15). In addition, the RGB feature extractor and the optical flow feature extractor were set up. For these 2 feature extractors, we fine-tuned the VGG-16 model pretrained on the ImageNet dataset and intercepted the first 30 layers for further use (16,17). After feature extraction, we stacked the RGB feature map and optical flow feature map we obtained as input, and used a 1×1 convolution kernel to keep the original HW constant, only halving the number of channels. The RPN operated on fusion convolution feature maps. Several rectangular anchors of different sizes were determined at each location of the fusion convolutional feature map. These anchor features were then input to 2 fully connected layers after a layer of convolution, 1 for object classification and the other for coordinate prediction. Finally, we obtained the required proposals and applied them to the fusion convolutional feature maps. After ROI pooling and classifier, we found the bleeding points. It is worth mentioning that the classifier was identical to the RCNN. The difference is that we only focused on the bleeding points, and the classifier further corrected the detection results. It can be seen as an improvement on the RPN results in this case, and we adopted the same loss function as faster RCNN in this work to optimize model parameters.
Statistical analysis
To assess the performance, we introduced precision rate (PR), recall rate (RR), and average precision (AP) as follows:
Where Tp and Fp are numbers of true positives and false positives, respectively. We identified a predicted bounding box as true positives if the intersection over union (IOU) between the ground truth bounding box and the predicted bounding box was greater than 0.5. Other predicted bounding boxes were identified as false positives. The PR reflected the ability of our model to identify the bleeding points.
The Fn is the number of false negatives. The Tp + Fn is the total number of samples that were labeled accurately as bleeding points. The RR reflected the ability of our model to find all the bleeding points (all ground truth bounding boxes).
The area under the curve (AUC) of the Precision x Recall curve is another way to compare the performance of object detectors. We used the new method of computing AP by the PASCAL VOC challenge to calculate the precision averaged across all recall values between 0 and 1 (18).
As a video detector, its task is to find where the target object first appears; the tracker then tracks where we found it. Since video image frames are different from static image, it may encounter motion blur due to camera movement. Running the detector on each video frame was inefficient due to information redundancy between adjacent frames, so we used the detection results for the first 5 frames when the target appeared.
Results
Ablation experiment
Optical flow parameters selection
Optical flow was determined from 2 images. In our video sequence, different frame rates could be used to obtain different next frame images. To investigate the effect of downsampling at different frame rates on performance, we selected previous image frames 1, 3, 5, and 10 to estimate the optical flow of the current image frame, respectively. Our data was sampled from the videos according to 25 fps. The previous image frame 5 means 5 fps and everything else is as follows. As shown in Figure 4, the best results were obtained when we sampled at 5 fps. Optical flow calculations in the following experiments were all based on this sampling.
Effects of different model component
To determine how each component of the model affects performance, we conducted an ablation study. First, to investigate the model effect on RGB images or optical flow map with only 1 feature extractor input, we used the original structure of faster RCNN to train a model with RGB image input and a model with optical flow map input. Second, we trained our model on the data.
Figure 5 shows the effect of different model components with the area under the PR curve as AP. It suggested that a single RGB or a single optical flow map resulted in low AP, but combining RGB and optical flow map could lead to significant improvements. As shown in Table 1, the model combining RGB and optical flow performed well on all 3 indicators (AP, RR, PR). We set the threshold to 0.7 to calculate RR and PR, showing that the recall rate was also good under high precision conditions.
Table 1
Model | AP | RR (threshold =0.7) | PR (threshold =0.7) |
---|---|---|---|
RGB | 0.3546 | 0.4477 | 0.7228 |
Optical flow | 0.2456 | 0.2802 | 0.2802 |
RGB + optical flow | 0.6818 | 0.8034 | 0.8373 |
AP, average precision; RR, recall rate; PR, precision rate; RGB, red-green-blue.
Detection results
To more intuitively visualize detection results in video sequences, the IOU was applied to objectively evaluate detection performance in videos. The IOU is defined as the overlap ratio between the predicted and ground-truth boxes. We set the threshold to 0.5 to explore detection accuracy. As shown in Figure 6, most frame IOUs were over 0.5 and around 0.7, indicating good detection results. Interested readers can find the corresponding videos in a supplementary appendix online, and Videos 1,2 show how our method performed in bleeding point detection.
As shown in Figure 7, the prediction box is set as a green box and the ground-truth box is set as a red box. The detection images of the first 5 frames intuitively demonstrated the high detection accuracy of our model, which were also confirmed by the indicators in Table 2. These results suggested that our method had reached the detector level.
Table 2
Model | AP | RR (threshold =0.7) | PR (threshold =0.7) |
---|---|---|---|
First 5 | 1 | 1 | 1 |
All | 0.6818 | 0.8034 | 0.8373 |
AP, average precision; RR, recall rate; PR, precision rate.
Discussion
The RCNN series methods mainly include RCNN, SPP-NET, Fast RCNN and Faster RCNN. The RCNN method uses the CNN network to extract image features to improve the ability of the features to represent samples, and uses supervised pre-training in large samples, supplemented by small-sample fine-tuning methods to solve problems such as difficulty in training small samples. The main improvement of SPP-NET method is extraction of the ROI features on feature maps, which greatly improved the processing efficiency. The Fast RCNN directly uses the Softmax instead of the SVM in R-CNN for classification, and adds bounding box regression to the network to achieve end-to-end training. Moreover, the Faster RCNN proposed the Region Proposal Network (RPN), which implements a complete End-To-End CNN target detection model. In this study, we developed a deep learning-based method for bleeding detection in medical videos. The biggest advantage of our method is extraction and application of the spatiotemporal features by using RGB frames and optical flow maps as input, which significantly improved the detection efficiency of bleeding points. Most previous work had used a single RGB information source to determine the blood region in the image, but it was difficult to determine whether there is a bleeding event based on this method. To address this issue, we first introduced the temporal information extracted from the optical flow map to directly locate the bleeding point. According to our literature search, we are the first team to have successfully detected bleeding points in video, and as rapid localization of bleeding points and hemostasis can reduce unnecessary blood loss and operation time, this is a meaningful work for surgeons. In clinics, venous bleeding is distinct from arterial bleeding. First, venous bleeding is much slower than arterial bleeding. Second, venous hemorrhage is manifested by the diffuse spreading of the blood pool, whereas in arterial hemorrhage, there are always distinct small blood columns. Our results showed that our model could deal with both conditions without a clear distinction between the 2 bleeding types.
Our work also had some limitations. The first is the false positives of our method. As shown in Figure 8, our method not only predicted an accurate box, but also predicted a box that deviated from ground-truth box. In this case, the blood column bounced off the upper tissue, misleading our system to track it as a secondary bleeding point. In the further, we would further train the optical flow network with medical scene data to make it more suitable for our tasks. Second, our method also failed to detect bleeding in some cases. In Figure 9, we show the success and failure of detection at the same time. In the successful case, the optical flow map clearly showed the blood flows. But in failed detection, the optical flow map became cluttered and scarce meaningful information could be extracted. Possible reasons for this puzzling optical flow map could be camera movement and the large overall vision, especially when the position and angle of the lens change greatly. Optical flow is a two-dimensional concept, and the failed detection might have resulted from image scaling.
In the future, we will try to use tracking techniques to predict the box in the next frame, further improve the optical flow map, and apply it to bleeding detection.
Conclusions
Our study introduces a novel bleeding detection method in laparoscopic surgery, taking full advantage of faster RCNN and optical flow, showing its value in improving surgical safety.
Acknowledgments
Funding: This work was supported by the Fundamental Research Funds for the Central Universities (No. 3332019020), and the Tsinghua University-Peking Union Medical College Hospital Initiative Scientific Research Program (No. PTQH201911015).
Footnote
Reporting Checklist: The authors have completed the MDAR reporting checklist. Available at https://atm.amegroups.com/article/view/10.21037/atm-22-1914/rc
Data Sharing Statement: Available at https://atm.amegroups.com/article/view/10.21037/atm-22-1914/dss
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://atm.amegroups.com/article/view/10.21037/atm-22-1914/coif). JW is from Hangzhou Hikvision Digital Technology Co. Ltd. GH is from Hangzhou Hikimaging Technology Co. Ltd. The other authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was performed in line with the principles of the Declaration of Helsinki (as revised in 2013), informed consent was provided by all study participants, and approval was granted by the Ethics Committee of Peking Union Medical College Hospital (No. S-K1902).
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Kaushik R. Bleeding complications in laparoscopic cholecystectomy: Incidence, mechanisms, prevention and management. J Minim Access Surg 2010;6:59-65. [Crossref] [PubMed]
- Garcia-Martinez A, Vicente-Samper JM, Sabater-Navarro JM. Automatic detection of surgical haemorrhage using computer vision. Artif Intell Med 2017;78:55-60. [Crossref] [PubMed]
- Okamoto T, Ohnishi T, Kawahira H, et al. Real-time identification of blood regions for hemostasis support in laparoscopic surgery. Signal Image Video Process 2019;13:405-12. [Crossref]
- Fu Y, Mandal M, Guo G. Bleeding region detection in WCE images based on color features and neural network. Seoul, Korea: 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), 2011.
- Fu Y, Zhang W, Mandal M, et al. Computer-aided bleeding detection in WCE video. IEEE J Biomed Health Inform 2014;18:636-42. [Crossref] [PubMed]
- Jia X, Meng MQH. A deep convolutional neural network for bleeding detection in wireless capsule endoscopy images. Orlando, FL, USA: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2016: 639-642.
- Casella A, Moccia S, Carlini C, et al. NephCNN: A deep-learning framework for vessel segmentation in nephrectomy laparoscopic videos. In 2020 25th International Conference on Pattern Recognition (ICPR), 2021.
- Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017;39:1137-49. [Crossref] [PubMed]
- Available online: https://github.com/Microsoft/VoTT. Accessed 18 October 2020.
- Fischer P, Dosovitskiy A, Ilg E, et al. Flownet: Learning optical flow with convolutional networks. Santiago, Chile: 2015 IEEE International Conference on Computer Vision (ICCV), 2016:2758-66.
- Hui TW, Tang X, Change Loy C. Liteflownet: A lightweight convolutional neural network for optical flow estimation. Salt Lake City, UT, USA: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018:8981-9.
- Ilg E, Mayer N, Saikia T, et al. Flownet 2.0: Evolution of optical flow estimation with deep networks. Honolulu, HI, USA: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017:2462-70.
- Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Columbus, OH, USA: 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
- Girshick R. Fast r-cnn. Santiago, Chile: 2015 IEEE International Conference on Computer Vision (ICCV), 2015:1440-8.
- Weinzaepfel P, Harchaoui Z and Schmid C. Learning to track for spatio-temporal action localization. Santiago, Chile: 2015 IEEE International Conference on Computer Vision (ICCV), 2015:3164-72.
- Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge. Int J Comput Vis 2015;115:211-52. [Crossref]
- Everingham M, Winn J. The PASCAL visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning.