GANPOP: Generative Adversarial Network Prediction of Optical Properties   from Single Snapshot Wide-field Images

Mason T. Chen; Faisal Mahmood; Jordan A. Sweer; and Nicholas J. Durr

arXiv:1906.05360·eess.IV·June 24, 2019

GANPOP: Generative Adversarial Network Prediction of Optical Properties from Single Snapshot Wide-field Images

Mason T. Chen, Faisal Mahmood, Jordan A. Sweer, and Nicholas J. Durr

PDF

TL;DR

GANPOP is a deep learning framework that accurately estimates tissue optical properties from single wide-field images, outperforming existing methods and working with both structured and flat-field illumination.

Contribution

This work introduces GANPOP, a novel GAN-based method for rapid, single-image optical property estimation applicable to various tissue types and illumination conditions.

Findings

01

GANPOP achieves 58% higher accuracy than SSOP in human gastrointestinal tissues.

02

It estimates optical properties with about 43% improvement over SSOP in swine tissues.

03

GANPOP performs well with flat-field illumination images, reducing the need for structured illumination.

Abstract

We present a deep learning framework for wide-field, content-aware estimation of absorption and scattering coefficients of tissues, called Generative Adversarial Network Prediction of Optical Properties (GANPOP). Spatial frequency domain imaging is used to obtain ground-truth optical properties from in vivo human hands, freshly resected human esophagectomy samples and homogeneous tissue phantoms. Images of objects with either flat-field or structured illumination are paired with registered optical property maps and are used to train conditional generative adversarial networks that estimate optical properties from a single input image. We benchmark this approach by comparing GANPOP to a single-snapshot optical property (SSOP) technique, using a normalized mean absolute error (NMAE) metric. In human gastrointestinal specimens, GANPOP estimates both reduced scattering and absorption…

Tables2

Table 1. TABLE I: Summary of networks trained in this study.

	Input		Output
$N_{i}$	R channel	G channel	R channel	G channel
$N_{1}$	$\frac{I_{A C}}{M_{D C, r e f}}$	$\frac{I_{A C}}{M_{A C, r e f}}$	$μ_{a}$	$μ_{s}^{'}$
$N_{2}$	$\frac{I_{A C}}{M_{D C, r e f}}$	$\frac{I_{A C}}{M_{A C, r e f}}$	$μ_{a, p r o f}$	$μ_{s, p r o f}^{'}$
$N_{3}$	$\frac{I_{D C}}{M_{D C, r e f}}$	$\frac{I_{D C}}{M_{A C, r e f}}$	$μ_{a}$	$μ_{s}^{'}$
$N_{4}$	$\frac{I_{D C}}{M_{D C, r e f}}$	$\frac{I_{D C}}{M_{A C, r e f}}$	$μ_{a, p r o f}$	$μ_{s, p r o f}^{'}$

Table 2. TABLE II: Performance comparison of the proposed framework against model-based SSOP and other deep learning architectures when tested on profile-uncorrected data ( N 1 subscript 𝑁 1 N_{1} ). Performance is measured in terms of normalized mean absolute error (NMAE).

Data type	ResNet		UNet		ResNet-UNet		ResNet GAN		UNet GAN		SSOP		Proposed
Data type	$μ_{a}$	$μ_{s}^{'}$	$μ_{a}$	$μ_{s}^{'}$	$μ_{a}$	$μ_{s}^{'}$	$μ_{a}$	$μ_{s}^{'}$	$μ_{a}$	$μ_{s}^{'}$	$μ_{a}$	$μ_{s}^{'}$	$μ_{a}$	$μ_{s}^{'}$
Human esophagus	0.227	0.143	0.161	0.153	0.203	0.140	0.232	0.156	0.176	0.165	0.301	0.290	0.124	0.121
In vivo pig colon	0.614	0.729	0.320	0.486	0.609	0.583	0.795	0.769	0.335	0.377	0.246	0.235	0.139	0.131
Ex vivo pig GI tissue	2.954	0.175	0.344	0.378	2.842	0.177	3.138	0.175	0.574	0.410	0.152	0.106	0.080	0.068
In vivo human hands	0.373	0.100	0.123	0.109	0.249	0.081	0.353	0.106	0.162	0.099	0.092	0.058	0.075	0.055
Overall	1.042	0.287	0.237	0.281	0.976	0.245	1.129	0.301	0.312	0.263	0.198	0.172	0.104	0.094

Equations13

M_{A C} (x) = \frac{2}{3} \cdot (I_{1} (x) - I_{2} (x))^{2} + (I_{2} (x) - I_{3} (x))^{2} + (I_{3} (x) - I_{1} (x))^{2},

M_{A C} (x) = \frac{2}{3} \cdot (I_{1} (x) - I_{2} (x))^{2} + (I_{2} (x) - I_{3} (x))^{2} + (I_{3} (x) - I_{1} (x))^{2},

R_{d} (x) = \frac{M _{A C} ( x )}{M _{A C, r e f} ( x )} \cdot R_{d, p r e d i c t e d} .

R_{d} (x) = \frac{M _{A C} ( x )}{M _{A C, r e f} ( x )} \cdot R_{d, p r e d i c t e d} .

L_{GAN} (G, D) =

L_{GAN} (G, D) =

+ E_{x \sim p_{data} (x)} [D (x, G (x))^{2}],

L_{1} (S) = E_{x, y \sim p_{data}} (x, y) [∣∣ y - G (x) ∣ ∣_{1}] .

L_{1} (S) = E_{x, y \sim p_{data}} (x, y) [∣∣ y - G (x) ∣ ∣_{1}] .

L (G, D) = L_{GAN} (G, D) + λ L_{1} (G),

L (G, D) = L_{GAN} (G, D) + λ L_{1} (G),

N M A E = \frac{\sum _{i = 1}^{T} ∣ p _{i} - p _{i, r e f} ∣}{\sum _{i = 1}^{T} p _{i, r e f}} .

N M A E = \frac{\sum _{i = 1}^{T} ∣ p _{i} - p _{i, r e f} ∣}{\sum _{i = 1}^{T} p _{i, r e f}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

GANPOP: Generative Adversarial Network Prediction of Optical Properties from Single Snapshot Wide-field Images

Mason T. Chen, Faisal Mahmood, Jordan A. Sweer, and Nicholas J. Durr All authors are with the Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218. Contact e-mail: [email protected].

Abstract

We present a deep learning framework for wide-field, content-aware estimation of absorption and scattering coefficients of tissues, called Generative Adversarial Network Prediction of Optical Properties (GANPOP). Spatial frequency domain imaging is used to obtain ground-truth optical properties from in vivo human hands, freshly resected human esophagectomy samples and homogeneous tissue phantoms. Images of objects with either flat-field or structured illumination are paired with registered optical property maps and are used to train conditional generative adversarial networks that estimate optical properties from a single input image. We benchmark this approach by comparing GANPOP to a single-snapshot optical property (SSOP) technique, using a normalized mean absolute error (NMAE) metric. In human gastrointestinal specimens, GANPOP estimates both reduced scattering and absorption coefficients at 660 nm from a single 0.2 mm-1 spatial frequency illumination image with 58% higher accuracy than SSOP. When applied to both in vivo and ex vivo swine tissues, a GANPOP model trained solely on human specimens and phantoms estimates optical properties with approximately 43% improvement over SSOP, indicating adaptability to sample variety. Moreover, we demonstrate that GANPOP estimates optical properties from flat-field illumination images with similar error to SSOP, which requires structured-illumination. Given a training set that appropriately spans the target domain, GANPOP has the potential to enable rapid and accurate wide-field measurements of optical properties, even from conventional imaging systems with flat-field illumination.

Index Terms:

optical imaging, tissue optical properties, neural networks, machine learning, spatial frequency domain imaging

I Introduction

The optical properties of tissues, including the absorption ( $\mu_{a}$ ) and reduced scattering ( $\mu_{s}^{\prime}$ ) coefficients, can be useful clinical biomarkers for measuring trends and detecting abnormalities in tissue metabolism, tissue oxygenation, and cellular proliferation [1, 2, 3, 4, 5]. Optical properties can also be used for contrast in functional or structural imaging [6, 7]. Thus, quantitative imaging of tissue optical properties can facilitate more objective, precise, and optimized management of patients.

To measure optical properties, it is generally necessary to decouple the effects of scattering and absorption, which both influence the measured intensity of remitted light. Separation of these parameters can be achieved with temporally or spatially resolved techniques, which can each be performed with measurements in the real or frequency domains. Spatial Frequency Domain Imaging (SFDI) decouples absorption from scattering by characterizing the tissue modulation transfer function to spatially modulated light [8, 9]. This approach has significant advantages in that it can easily be implemented with a consumer grade camera and projector, and achieve rapid, non-contact mapping of optical properties. These advantages make SFDI well-suited for applications that benefit from wide-field characterization of tissues, such as image-guided surgery [10, 11] and wound characterization [12, 13, 14]. Additionally, recent work has explored the use of SFDI for improving endoscopic procedures [15, 16].

Although SFDI is finding a growing number of clinical applications, there are remaining technical challenges that limit its adoption. First, SFDI requires structured light projection with carefully-controlled working distance and calibration, which is especially challenging in an endoscopic setting. Second, it is difficult to achieve real-time measurements. Conventional SFDI requires a minimum of six images per wavelength (three distinct spatial phases at two spatial frequencies) to generate a single optical property map. A lookup table (LUT) search is then performed for optical property fitting. The recent development of real-time single snapshot imaging of optical properties (SSOP) has reduced the number of images required per wavelength from 6 to 1, considerably shortening acquisition time [17]. However, SSOP introduces image artifacts arising from single-phase projection and frequency filtering, which corrupt the optical property estimations. To reduce barriers to clinical translation, there is a need for optical property mapping approaches that are simultaneously fast and accurate while requiring minimal modifications to conventional camera systems.

Here, we introduce a deep learning framework to predict optical properties directly from single images. Deep networks, especially convolutional neural networks (CNNs), are growing in popularity for medical imaging tasks, including computer-aided detection, segmentation, and image analysis[18, 19, 20]. We pose the optical property estimation challenge as an image-to-image translation task and employ generative adversarial networks (GANs) to efficiently learn a transformation that is robust to input variety. First proposed in [21], GANs have improved upon the performance of CNNs in image generation by including both a generator and a discriminator. The former is trained to produce realistic output, while the latter is tasked to classify generator output as real or fake. The two components are trained simultaneously to outperform each other, and the discriminator is discarded once the generator has been trained. When both components observe the same type of data, such as text labels or input images, the GAN model becomes conditional. Conditional GANs (cGANs) are capable of making structured predictions by incorporating non-local, high-level information. Moreover, because they can automatically learn a loss function instead of using a handcrafted one, cGANs have the potential to be an effective and generalizable solution to various image-to-image translation tasks [22, 23]. In medical imaging, cGANs have been proven successful in many applications, such as image synthesis [24], noise reduction [25], and sparse reconstruction [26]. In this study, we train cGAN networks on a series of structured or flat-field illumination images paired with corresponding optical property maps (Fig. 1). We demonstrate that the GANPOP approach produces rapid and accurate estimation from input images from a wide variety of tissues using a relatively small set of training data.

II Related Work

II-A Diffuse reflectance imaging

Optical absorption and reduced scattering coefficients can be measured using temporally or spatially resolved diffuse reflectance imaging. Approaches that rely on point illumination inherently have a limited field of view [27, 28]. Non-contact, hyperspectral imaging techniques measure the attenuation of light at different wavelengths, from which the concentrations of tissue chromophores, such as oxy- and deoxy-hemoglobin, water, and lipids, can be quantified [29]. A recent study has also proposed using a Bayesian framework to infer tissue oxygen concentration by recovering intrinsic multispectral measurements from RGB images [30]. However, these methods fail to unambiguously separate absorption and scattering coefficients, which poses a challenge for precise chromophore measurements. Moreover, accurate determination of both parameters is critical for the detection and diagnosis of diseases [1, 5].

II-B Single snapshot imaging of optical properties

SSOP achieves optical property mapping from a single structured light image. Using Fourier domain filtering, this method separates DC (planar) and AC (spatially modulated) components from a single-phase structured illumination image [17]. A grid pattern can also be applied to simultaneously extract optical properties and three-dimensional profile measurements [31]. When tested on homogeneous tissue-mimicking phantoms, this method is able to recover optical properties within 12% for absorption and 6% for reduced scattering using conventional profilometry-corrected SFDI as ground truth.

II-C Machine learning in optical property estimation

Despite its prevalence and increasing importance in the field of medical imaging, machine learning has only recently been explored for optical property mapping. This includes a random forest regressor to replace the nonlinear model inversion [32], and using deep neural networks to reconstruct optical properties from multifrequency measurements [33]. Both of these approaches aim to bypass the time-consuming LUT step in SFDI. However, they require diffuse reflectance measurements from multiple images to achieve accurate results and consider each pixel independently.

III Contributions

We propose an adversarial framework for learning a content-aware transformation from single illumination images to optical property maps. In this work, we:

develop a data-driven model to estimate optical properties directly from input reflectance images; 2. 2)

demonstrate advantages of structured versus flat-field light as an input to determine optical properties; 3. 3)

perform cross-validated experiments, comparing our technique with model-based SSOP and other deep learning-based methods; and 4. 4)

acquire and make publicly-available a dataset of registered flat-field-illumination images, structured-illumination images, and ground-truth optical properties of a variety of ex vivo and in vivo tissues.

IV Methods

For training and testing of the GANPOP model, single structured or flat-field illumination images were used, paired with registered optical property maps. To generate ground truth optical properties, conventional six-image SFDI was implemented. GANPOP performance was analyzed and compared to other techniques both in unseen tissues of the same type as the training data (new ex vivo esophagus) and in different tissue types (in vivo and ex vivo swine gastrointestinal tissues).

IV-A Hardware

In this study, all images were captured using a commercially available SFDI system (Reflect RSTM, Modulated Imaging Inc.). A schematic of the system is shown in Fig. 2. Cross polarizers were utilized to reduce the effect of specular reflections, and images were acquired in a custom-built light enclosure to minimize ambient light. Raw images, after 2x2 pixel hardware binning, were 520 $\times$ 696 pixels, with a pixels size of 0.278 mm in the object space.

IV-B SFDI ground truth optical properties

Ground truth optical property maps were generated using conventional SFDI with 660 nm light following the method from Cuccia et al. [9]. First, images of a calibration phantom with homogeneous optical properties and the tissue of interest are captured under spatially modulated light. We used a flat polydimethylsiloxane-titanium dioxide (PDMS-TiO2) phantom with reduced scattering coefficient of 0.957 mm-1 and absorption coefficient of 0.0239 mm-1 at 660 nm. We project spatial frequencies of 0 mm-1 (DC) and 0.2 mm-1 (AC), each at three different phase offsets ([math], $\frac{2}{3}\pi$ , and $\frac{4}{3}\pi$ ) for this study. AC images are demodulated at each pixel $x$ using:

[TABLE]

where $I_{1}$ , $I_{2}$ , and $I_{3}$ represent images at the three phase offsets. The spatially varying DC amplitude is calculated as the average of the three DC images. Diffuse reflectance at each pixel $x$ is then computed as:

[TABLE]

Here, $M_{AC,ref}$ denotes the demodulated AC amplitude of the reference phantom, and $R_{d,predicted}$ is the diffuse reflectance predicted by Monte Carlo models. Where indicated, we corrected for height and surface angle variation of each pixel from depth maps measured via profilometry. Profilometry measurements were obtained by projecting a spatial frequency of 0.15 mm-1 and calculating depth at each pixel [34]. Finally, $\mu_{a}$ and $\mu_{s}^{\prime}$ are estimated by fitting $R_{d,0mm^{-1}}$ and $R_{d,0.2mm^{-1}}$ into an LUT previously created using Monte Carlo simulations [35].

IV-C Single Snapshot Optical Properties (SSOP)

SSOP was implemented as the model-based alternative of GANPOP. This method separates DC and AC components from a single-phase structured-light image by frequency filtering with a 2D band-stop filter and a high-pass filter [17]. Both filters are rectangular windows that isolate the frequency range of interest while preserving high-frequency content of the image. In this study, cutoff frequencies $f_{DC}=$ [0.16 mm-1, 0.24 mm-1] and $f_{AC}=$ [0, 0.16 mm-1] were selected [31]. $M_{DC}$ can subsequently be recovered through a 2D inverse Fourier transform, and the AC component is obtained using an additional Hilbert transform.

IV-D GANPOP Architecture

The GANPOP architecture is based on an adversarial training framework. When used in a conditional GAN-based image-to-image translation setup, this framework has the ability to learn a loss function while avoiding the uncertainty inherent in using hand-crafted loss functions [23, 36]. The generator is tasked with predicting pixel-wise optical properties from SFDI images while the discriminator classifies pairs of SFDI images and optical property maps as being real or fake (Fig. 1). The discriminator additionally gives feedback to the generator over the course of training. The generator employs a modified U-Net consisting of an encoder and a decoder with skip connections [37]. However, unlike the original U-Net, the GANPOP network includes properties of a ResNet, including short skip connections within each level [38] (Fig. 3). Each residual block is a 3-layer building block with an additional convolutional layer on both sides. This ensures that the number of input features matches that of the residual block and that the network is symmetric [39]. Moreover, GANPOP generator replaces the U-Net concatenation step with feature addition, making it a fully residual network. Using $n$ as the total number of layers in the encoder-decoder network and $i$ as the current layer, long skip connections are added between the $i^{th}$ and the $(n-i)^{th}$ layer in order to sum features from the two levels. After the last layer in the decoder, a final convolution is applied to shrink the number of output channels and is followed by a $Tanh$ function. Regular ReLUs are used for the decoder and leaky ReLUs (slope = 0.2) for the encoder. We chose a receptive field of 70 $\times$ 70 pixels for our discriminator because this window captures two periods of AC illumination in each direction. This discriminator is a three-layer classifier with leaky ReLUs (slope = 0.2), as discussed in [23]. The discriminator makes classification decisions based on the current batch as well as a batch randomly sampled from 64 previously generated image pairs. Both networks are trained iteratively and the training process is stabilized by incorporating spectral normalization in both the generator and the discriminator [40]. The conditional GAN objective for generating optical property maps from input images ( $G:X\rightarrow Y$ ) is:

[TABLE]

where $G$ is the generator, $D$ the discriminator, and $p_{\text{data}}$ is the optimal distribution of the data. We empirically found that a least squares GAN (LSGAN) objective [41] produced slightly better performance in predicting optical properties than a traditional GAN objective [21], and so we utilize LSGAN in the networks presented here. An additional $\mathcal{L\textsubscript{1}}$ loss term was added to the GAN loss to further minimize the distance from the ground truth distribution and stabilize adversarial training:

[TABLE]

The full objective can be expressed as:

[TABLE]

where $\lambda$ is the regularization parameter of the $\mathcal{L\textsubscript{1}}$ loss term. This optimization problem was solved using an Adam solver with a batch size of 1 [42]. The training code was implemented using Pytorch 1.0 on Ubuntu 16.04 with Google Cloud. For all experiments, $\lambda$ was set to 60. A total of 200 epochs was used with a learning rate of 0.0002 for half of the epochs and the learning rate was linearly decayed for the remaining half. Both networks were initialized from a Gaussian distribution with a mean and standard deviation of 0 and 0.02, respectively.

Conventional neural networks typically operate on three-channel (or RGB) images as input and output, with each channel representing red (R), green (G), or blue (B). In this study, four separate networks ( $\text{N}_{1}$ to $\text{N}_{4}$ ) were trained for image-to-image translation with a variety of input and output parameters, summarized in Table I.

For input, $I_{AC}$ and $I_{DC}$ represent single-phase raw images at 0.2 mm-1 and 0 spatial frequency, respectively. $M_{DC,ref}$ and $M_{AC,ref}$ are the demodulated DC and AC amplitude of the calibration phantom. Blue channels are left as zeros in all cases. It is important to note that $M_{AC,ref}$ and $M_{DC,ref}$ are measured only once during calibration before the imaging experiment and thus do not add to the total acquisition time. The purpose of these two terms is to account for drift of the system over time and correct for non-uniform illumination, making the patch used in the network origin-independent. These two calibration images are also required by traditional SFDI and the SSOP approaches. A network without calibration was empirically trained, and it produced 230% and 58% larger error than with calibration in absorption and scattering coefficients, respectively. A single output image contains both $\mu_{a}$ and $\mu_{s}^{\prime}$ in different channels. Two dedicated networks were empirically trained for estimating $\mu_{a}$ and $\mu_{s}^{\prime}$ independently, but no accuracy benefits were observed. Optical property maps calculated by non-profile-corrected SFDI were used as ground truth for $N_{1}$ and $N_{3}$ . We also assessed the ability of GANPOP to learn both optical property estimation and sample height and surface normal correction by training and testing with profilometry-corrected data ( $N_{2}$ and $N_{4}$ ).

All optical property maps for training and testing were normalized to have a consistent representation in the 8-bit images commonly used in CNNs [43]. We defined the maximum value of 255 to be 0.25 mm-1 for $\mu_{a}$ and 2.5 mm-1 for $\mu_{s}^{\prime}$ . Additionally, each image of size 520 $\times$ 696 was segmented at a random stride size into multiple patches of 256 $\times$ 256 pixels and paired with a registered optical property patch for training, as shown in Fig. 4.

IV-E Tissue Samples

IV-E1 Ex vivo human esophagus

Eight ex vivo human esophagectomy samples were imaged at Johns Hopkins Hospital for training and testing of our networks. All patients were diagnosed with esophageal adenocarcinoma and were scheduled for an esophagectomy. The research protocol was approved by the Johns Hopkins Institutional Review Board and consents were acquired from all patients prior to each study. All samples were handled by a trained pathologist and imaged within one hour after resection [44].

Example raw images of a specimen captured by the SFDI system are shown in Fig. 2, 4, and 11(a). All samples consisted of the distal esophagus, the gastroesophageal junction, and the proximal stomach. The samples contain complex topography and a relatively wide range of optical properties (0.02-0.15 mm-1 for $\mu_{a}$ and 0.1-1.5 mm-1 for $\mu_{s}^{\prime}$ at $\lambda$ = 660 nm), making it suitable for training a generalizable model that can be applied to other tissues with non-uniform surface profiles. An illumination wavelength of 660 nm was chosen because it is close to the optimal wavelength for accurate tissue oxygenation measurements [45].

In this study, six ex vivo human esophagus samples were used for training of the GANPOP model and two used for testing. A leave-two-out cross validation method was implemented, resulting in four iterations of training for each network. Performance results reported here are from an average of these four iterations.

IV-E2 Homogeneous phantoms

The four GANPOP networks were also trained on a set of tissue-mimicking silicone phantoms made from PDMS-TiO2 (P4, Eager Plastics Inc.) mixed with India ink as absorbing agent [46]. To ensure homogeneous optical properties, the mixture was thoroughly combined and poured into a flat mold for curing. In total, 18 phantoms with unique combinations of $\mu_{a}$ and $\mu_{s}^{\prime}$ were fabricated, and their optical properties are summarized in Fig. 5.

In this study, six tissue-mimicking phantoms were used for training and twelve for testing. We intentionally selected phantoms for training that had optical properties not spanned by the esophagus training samples (highlighted by green ellipses in Fig. 6), in order to develop GANPOP networks capable of estimating the widest range of optical properties.

IV-E3 In vivo samples

To provide the network with in vivo samples that were perfused and oxygenated, seven human hands with different levels of pigmentation (Fitzpatrick skin types 1-6) were imaged with SFDI. Two were used for training and five for testing.

IV-E4 Swine tissue

Four specimens of upper gastrointestinal tracts that included stomach and esophagus were harvested from four different pigs for ex vivo imaging with SFDI. Optical properties of these samples are summarized in Fig. 7. Additionally, we imaged a pig colon in vivo during a surgery. The live study was performed with approval from Johns Hopkins University Animal Care and Use Committee (ACUC). All swine tissue images were used exclusively for testing optical property prediction.

IV-F Performance Metric

Normalized Mean Absolute Error (NMAE) was used to evaluate the performance of different methods, which was calculated using:

[TABLE]

$p_{i}$ and $p_{i,ref}$ are pixel values of predicted and ground-truth data, and $T$ is the total number of pixels. The metric was calculated using SFDI output as ground truth. A smaller NMAE value indicates better performance.

V Results

V-A SSOP validation

For benchmarking, SSOP was implemented as a model-based counterpart of GANPOP. For independent validation, we applied SSOP to 18 homogeneous tissue phantoms (Fig. 5). Each value was calculated as the mean of a 100 $\times$ 100-pixel region of interest (ROI) from the center of the phantom, with error bars showing standard deviations. SSOP demonstrates high accuracy in predicting optical properties of the phantoms, with an average percentage error of 2.35% for absorption and 2.69% for reduced scattering.

V-B GANPOP test in homogeneous phantoms

Phantom optical properties predicted by $N_{1}$ are plotted with ground truth in Fig. 6. Each optical property reported is the average value of a 100 $\times$ 100 ROI of a homogeneous phantom, with error bars showing standard deviations. On average, GANPOP produced 3.06% error for absorption and 1.26% for scattering. The scatter plot in Fig. 6 is overlaid on a 2D histogram of pixel counts for each ( $\mu_{a}$ , $\mu_{s}^{\prime}$ ) pair used in an example training iteration. Green ellipses indicate training samples from homogeneous phantoms. The three testing results enclosed by red boxes have optical properties outside of the range spanned by the training data but were still reasonably estimated by the GANPOP network.

V-C GANPOP test on ex vivo human esophagus

GANPOP and SSOP were tested on the ex vivo human esophagus samples. NMAE scores were calculated for the two testing samples from each of four-fold cross validation iterations, and the average values from the four networks tested on a total of eight samples are reported in Fig. 8. Results from $N_{2}$ , $N_{4}$ , and SSOP are also compared to profilometry-corrected ground truth and shown in the same bar chart. On average, GANPOP produced approximately 58% higher accuracy with AC input than SSOP. Example optical property maps of a testing sample generated by $N_{1}$ are shown in Fig. 11(a).

V-D GANPOP test on ex vivo pig samples

Each of the four GANPOP networks were tested on ex vivo esophagus and stomach samples from four pigs. Average NMAE scores for GANPOP and SSOP method were calculated for all eight pig tissue specimens (four esophagi and four stomachs) and are summarized in Fig. 9. Background regions, which were absorbing paper, were manually masked in the calculation, and the reported scores are the average values of 779,101 tissue pixels. The optical properties of the pig samples are also shown in a 2D histogram in Fig. 7. Despite the fact that some testing samples had optical properties not covered by the training set, GANPOP outperforms SSOP in terms of average accuracy and qualitative image quality (Fig. 11).

V-E GANPOP test on in vivo pig colon

The networks were additionally tested on an in vivo pig colon. Average NMAE scores for GANPOP and SSOP are reported in Fig. 10 as average values of 118,594 pixels. The generated maps are shown in Fig. 11(c). The proposed technique produces more accurate results than SSOP when compared to both uncorrected and profile-corrected ground truth data.

V-F Comparative analysis of existing deep networks

Several deep learning architectures were explored for the purpose of optical property mapping, including conventional U-Net [37] and ResNet [38], both stand-alone and integrated in a cGAN framework [23, 39]. The NMAE performance of each architecture was compared to GANPOP. All the networks were four-fold cross validated, and the testing dataset included eight ex vivo human esophagi, four ex vivo pig GI samples, one in vivo pig colon, and five in vivo hands (Table II).

VI Discussion

In this study, we have described a GAN-based technique for end-to-end optical property mapping from single structured and flat-field illumination images. Compared to the original pix2pix paradigm [23], the generator of our model adopted a fusion of U-Net and ResNet architectures for several reasons. First, a fully residual network effectively resolved the issue of vanishing gradients, allowing us to stably train a relatively deep neural network [39]. Second, the use of both long and short skip connections enables the network to learn from the structure of the images while preserving both low and high frequency details. The information flow both within and between levels is important for the prediction of optical properties, as demonstrated by the improved performance over a U-Net or ResNet approach. Moreover, as shown in Table II, the inclusion of a discriminator significantly improved the performance of the fusion generator. This was especially apparent in the case for pig data, likely due to this testing tissue differing considerably from the training samples. We hypothesize that the cGAN architecture enforced the similarity between generated images and ground truth while preventing the generator from depending too much on the context of the image. Overall, the GANPOP method outperformed the other deep networks by a significant margin on all data types (Table II). Additionally, we empirically found that a least squares GAN outperformed a conventional GAN when trained for 200 epochs. However, as discussed in [47], this improvement could potentially be matched by a conventional GAN with more training.

Compared to phantom ground truth in Fig. 6, GANPOP estimated optical properties with standard deviations on the same order of magnitude as conventional SFDI. Additionally, the GANPOP networks exhibited potential to extrapolate phantom optical properties that were not present in the training samples (highlighted by the red boxes in Fig. 6). This provides evidence that these networks have successfully learned the relationship between diffuse reflectance and optical properties, and are able to infer beyond the range of training data.

Fig. 8, 9, and 10 show that GANPOP with AC input consistently outperformed SSOP when tested on these types of data. From Fig. 7, it is evident that optical properties of the pig samples differed considerably from those of human esophagi used for training. Nevertheless, GANPOP exhibited more accurate estimation than the model-based SSOP benchmark. Moreover, a single network was trained for estimating both $\mu_{a}$ and $\mu_{s}^{\prime}$ due to its lower computational cost and potential benefits in learning the relationships between the two parameters in tissues.

Compared to SSOP, GANPOP optical property maps contain fewer artifacts caused by frequency filtering (Fig. 11). For both GANPOP and SSOP optical property estimation, a relatively large error is present on the edge of the sample. This is caused by the transition between tissue and the background, which poses problems for SFDI ground truth, and would be less significant for in vivo imaging. Artifacts caused by patched input are visible in GANPOP images, which can be reduced by using a larger patch size. However, this was not implemented in our study due to the size and the number of the specimens available for training. In our benchmarking with SSOP, we implemented the first version of the technique, which does not correct for sample height and surface angle variations. Recent developments have enabled these corrections by utilizing a more complex illumination pattern and additional processing steps [31]. We implemented the original version of SSOP because it allowed comparing identical input images for both SSOP and GANPOP.

In addition to training GANPOP models to estimate optical properties from objects assumed to be flat ( $N_{1}$ and $N_{3}$ ), we also trained networks that directly estimate profilometry-corrected optical properties ( $N_{2}$ and $N_{4}$ ). For the same AC input, these models generated improved results over SSOP when tested on human and pig data. Moreover, when compared against profile-corrected ground truth, they produced 35.7% less error for $\mu_{a}$ and 44.7% for $\mu_{s}^{\prime}$ than did uncorrected GANPOP results from $N_{1}$ and $N_{3}$ . This means that GANPOP is capable of inferring surface profile from a single fringe image and adjusting measured diffuse reflectance accordingly. In experiment $N_{3}$ and $N_{4}$ , when trained on DC illumination images, the GANPOP model became less accurate. Nevertheless, these networks converged during training, and albeit less accurate, the ex vivo human results still produced a lower NMAE than SSOP. Hence, given a sufficiently large training dataset, GANPOP has the potential to enable rapid and accurate wide-field measurements of optical properties from conventional camera systems. This could be useful for applications such as endoscopic imaging of the GI tract, where the range of tissue optical properties is limited and modification of the hardware system is challenging.

In terms of speed, GANPOP requires capturing one sample image instead of six, thus significantly shortening data acquisition time. For optical property extraction, the model developed here without optimization takes approximately 0.04 s to process a 256 $\times$ 256 image on an NVIDIA Tesla P100 GPU. Therefore, this technique has the potential to be applied in real time for fast and accurate optical property mapping. In terms of adaptability, random cropping ensures that our trained models work on any 256 $\times$ 256 patches within the field of view. Additionally, while the models were trained on the same calibration phantom at 660 nm, they could theoretically be applied to other references or wavelengths by scaling the average $M_{DC,ref}$ and $M_{AC,ref}$ .

For future work, a more generalizable model could be trained on a wider range of optical properties and imaging geometries, though this would inevitably incur a higher computational cost and necessitate a much larger dataset for training. For example, all input images used here were acquired at an approximately-constant working distance. Incorporating monocular depth estimates into the prediction may enable GANPOP to account for large differences in working distance [48, 49]. This could be particularly useful for endoscopic screening where constant imaging geometries are difficult to achieve. Having a model trained on images at multiple wavelengths, this technique can be modified to provide critical information in real time, such as tissue oxygenation and metabolism biomarkers. Accuracy in this application may also benefit from training adversarial networks to directly estimate these biomarkers rather than using optical properties as intermediate representations. By similar extension, future research may develop networks to directly estimate disease diagnosis and localization from structured light images.

VII Conclusion

We have proposed a deep learning-based approach to optical property mapping (GANPOP) from single snapshot wide-field images. This model utilizes a conditional Generative Adversarial Network consisting of a generator and a discriminator that are iteratively trained in concert with one another. Using SFDI-determined optical properties as ground truth, GANPOP produces significantly more accurate optical property maps than a model-based SSOP benchmark. Importantly, we have demonstrated that GANPOP can estimate optical properties with conventional flat-field illumination, potentially enabling optical property mapping in endoscopy without modifications for structured illumination. This method lays the foundation for future work in incorporating real-time, high-fidelity optical property mapping and quantitative biomarker imaging into endoscopy and image-guided surgery applications.

Acknowledgment

This work was supported in part with funding from the NIH Trailblazer Award (R21 EB024700).

We would like to thank Dr. Darren Roblyer’s group at Boston University for sharing SFDI software.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Richards-Kortum and E. Sevick-Muraca, “Quantitative optical spectroscopy for tissue diagnosis,” Annual Review of Physical Chemistry , vol. 47, no. 1, pp. 555–606, 1996.
2[2] R. A. Drezek, M. Guillaud, T. G. Collier, I. Boiko, A. Malpica, C. E. Mac Aulay, M. Follen, and R. R. Richards-Kortum, “Light scattering from cervical cells throughout neoplastic progression: influence of nuclear morphology, dna content, and chromatin texture,” Journal of biomedical optics , vol. 8, no. 1, pp. 7–17, 2003.
3[3] B. W. Maloney, D. M. Mc Clatchy, B. W. Pogue, K. D. Paulsen, W. A. Wells, and R. J. Barth, “Review of methods for intraoperative margin detection for breast conserving surgery,” Journal of biomedical optics , vol. 23, no. 10, p. 100901, 2018.
4[4] J. R. Mourant, M. Canpolat, C. Brocker, O. Esponda-Ramos, T. M. Johnson, A. Matanock, K. Stetter, and J. P. Freyer, “Light scattering from cells: the contribution of the nucleus and the effects of proliferative status,” Journal of biomedical optics , vol. 5, no. 2, pp. 131–138, 2000.
5[5] Z. A. Steelman, D. S. Ho, K. K. Chu, and A. Wax, “Light-scattering methods for tissue diagnosis,” Optica , vol. 6, no. 4, pp. 479–489, Apr 2019.
6[6] A. J. Lin, M. A. Koike, K. N. Green, J. G. Kim, A. Mazhar, T. B. Rice, F. M. La Ferla, and B. J. Tromberg, “Spatial frequency domain imaging of intrinsic optical property contrast in a mouse model of alzheimer’s disease,” Annals of Biomedical Engineering , vol. 39, no. 4, pp. 1349–1357, Apr 2011.
7[7] N. Shah, A. Cerussi, C. Eker, J. Espinoza, J. Butler, J. Fishkin, R. Hornung, and B. Tromberg, “Noninvasive functional optical spectroscopy of human breast tissue,” Proceedings of the National Academy of Sciences , vol. 98, no. 8, pp. 4420–4425, 2001.
8[8] N. Dögnitz and G. Wagnières, “Determination of tissue optical properties by steady-state spatial frequency-domain reflectometry,” Lasers in medical science , vol. 13, no. 1, pp. 55–65, 1998.