Efficient Informed Proposals for Discrete Distributions via Newton's Series Approximation
Yue Xiang, Dongyao Zhu, Bowen Lei, Dongkuan Xu, Ruqi Zhang

TL;DR
This paper introduces a novel Newton's series-based approach for creating efficient, gradient-like proposals for discrete distributions, enabling faster convergence in MCMC without requiring differentiable target distributions.
Contribution
The authors develop a general, gradient-free proposal method for discrete distributions using Newton's series expansion, improving exploration and convergence in MCMC algorithms.
Findings
Outperforms existing methods in various experiments
Provides guaranteed convergence rates
Effective in diverse applications like text summarization and image retrieval
Abstract
Gradients have been exploited in proposal distributions to accelerate the convergence of Markov chain Monte Carlo algorithms on discrete distributions. However, these methods require a natural differentiable extension of the target discrete distribution, which often does not exist or does not provide effective gradient guidance. In this paper, we develop a gradient-like proposal for any discrete distribution without this strong requirement. Built upon a locally-balanced proposal, our method efficiently approximates the discrete likelihood ratio via Newton's series expansion to enable a large and efficient exploration in discrete spaces. We show that our method can also be viewed as a multilinear extension, thus inheriting its desired properties. We prove that our method has a guaranteed convergence rate with or without the Metropolis-Hastings step. Furthermore, our method outperforms a…
| Natural differentiable extension available | Natural differentiable extension unavailable | |||||||
|
|
|
||||||
|
|
|
| Method | R | F | P | ESS | Runtime | |||
| MANA | 6.28 | 6.40 | 6.46 | 8.72 | 8.85 | 10.96 | 57.2 | 6.7 |
| LB | 6.28 | 6.39 | 6.42 | 7.91 | 8.40 | 10.68 | 41.8 | 9.9 |
| Gibbs | 6.01 | 6.12 | 6.42 | 7.91 | 8.31 | 10.62 | 15.2 | 1.0 |
| Method | ) | mAP | ||
| MANA | 11.36 | 11.37 | 11.38 | 0.84 |
| LB | 11.04 | 11.05 | 11.05 | 0.55 |
| Gibbs | 11.01 | 11.01 | 11.03 | 0.53 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Bayesian Methods and Mixture Models
**Efficient Informed Proposals for Discrete Distributions
via Newton’s Series Approximation**
Yue Xiang
Dongyao Zhu
Bowen Lei
Renmin University of China
Independent Researcher
Texas A&M University
Dongkuan Xu
Ruqi Zhang
North Carolina State University
Purdue University
Abstract
Gradients have been exploited in proposal distributions to accelerate the convergence of Markov chain Monte Carlo algorithms on discrete distributions. However, these methods require a natural differentiable extension of the target discrete distribution, which often does not exist or does not provide effective gradient guidance. In this paper, we develop a gradient-like proposal for any discrete distribution without this strong requirement. Built upon a locally-balanced proposal, our method efficiently approximates the discrete likelihood ratio via Newton’s series expansion to enable a large and efficient exploration in discrete spaces. We show that our method can also be viewed as a multilinear extension, thus inheriting its desired properties. We prove that our method has a guaranteed convergence rate with or without the Metropolis-Hastings step. Furthermore, our method outperforms a number of popular alternatives in several different experiments, including the facility location problem, extractive text summarization, and image retrieval.
1 INTRODUCTION
Discrete structures are common in the real world, from discrete data such as text [Wang and Cho, 2019, Gu et al., 2017] and genomes [Wang et al., 2010], to discrete models such as low-precision neural networks [Courbariaux et al., 2016, Peters and Welling, 2018] and graphical models of molecules [Gilmer et al., 2017]. As data and models become complex and large-scale, it is desirable to develop efficient proposals in Markov chain Monte Carlo (MCMC) algorithms that allow us to sample from these complex high-dimensional discrete distributions [Zhang et al., 2022a].
Gradients have been widely utilized in proposal distributions to accelerate the convergence of MCMC, such as the Langevin algorithm [Roberts and Tweedie, 1996, Roberts and Stramer, 2002] and Hamiltonian Monte Carlo (HMC) [Duane et al., 1987, Neal et al., 2011]. These gradient-based methods are mainly designed for continuous distributions and require a natural neighborhood to define gradients. However, since there is no natural neighborhood in discrete distributions, it becomes challenging to incorporate gradients in the proposal to accelerate sampling.
Previous research has been devoted to making gradient-based proposals for efficient discrete sampling, however, they either require natural differentiable relaxation or sacrifice convergence speed. As shown in the “Natural continuous extension available” column in Table 1, Gibbs with gradient proposal [Grathwohl et al., 2021] and discrete Langevin proposal (DLP) [Zhang et al., 2022a] assume the existence of an underlying differentiable distribution in the discrete space and exploit gradient information to speed up sampling and inference. In the top right of Table 1, the locally-balanced proposal [Zanella, 2020] does not require this assumption, while it only conducts local moves in small windows, which leads to slow convergence, especially in high-dimensional tasks. Therefore, the question we need to answer is how to design a method that can sample efficiently in discrete spaces and has no strong requirement of natural continuous expansion.
To answer this question, we present a gradient-like informed proposal to efficiently sample from any discrete distribution without the strong requirement of continuous relaxations. To find more informative directions during sampling, our method estimates the likelihood ratio of making discrete moves through Newton’s series expansion, so we call our proposal Newton proposal. In addition, we design a coordinatewise factorization scheme in our method, so we can update multiple coordinates in a single move, which further improves the sampling efficiency. We summarize our contributions as follows:
- •
We propose a new informed proposal, Newton proposal, for discrete distributions. It allows multiple coordinates to be updated simultaneously while not requiring that the discrete distribution to be naturally extended to the continuous domain.
- •
We show that our Newton proposal can be obtained from Newton’s series approximation to the target discrete distribution or from Taylor expansion to the multilinear extension of the discrete distribution, which justifies Newton proposal’s desirable properties.
- •
We theoretically prove the convergence rate of our Newton scheme without and with the Metropolis-Hastings correction, demonstrating its efficient sampling in discrete distributions.
- •
We experimentally show that Newton proposal outperforms existing discrete proposals and some optimization-based methods when sampling from high-dimensional complex discrete distributions, including facility location, text summarization, and image retrieval.
2 RELATED WORK
Informed Proposal Various informed Metropolis-Hastings (MH) proposal distributions have been designed to avoid slow mixing and slow convergence brought by random walk MH proposals. Using symmetric proposal distributions, random walk MH schemes are easy to implement, but no information about the target distribution is utilized and the new state is proposed randomly. In the contrary, informed proposal distributions elaborate information about the target distribution, such as the gradient of the target to bias the proposal distribution towards high probability, resulting in substantial improvements of MCMC performances. However, most of these informed proposals are based on derivatives and it is nontrivial to extend such methods to discrete spaces. As a consequence, most MCMC proposals for discrete spaces often rely on symmetric and uninformed proposal distributions, which can induce slow convergence.
Continuous Relaxation-Based Method Gradient-based informed proposals can be applied to discrete distributions via continuous relaxations [Pakman and Paninski, 2013, Nishimura et al., 2020, Han et al., 2020, Zhou, 2020, Jaini et al., 2021, Zhang et al., 2022b]. They are usually implemented by transporting the problem into a continuous domain, performing updates under gradient-based proposals there, and transforming back after sampling. The efficiency of this kind of continuous relaxation highly depends on the properties of the relaxed continuous distributions which may be arbitrarily difficult to sample from, such as being highly multi-modal. To avoid these pitfalls, for discrete distributions which can be displayed as continuous, differentiable functions accepting real-valued inputs but are evaluated only on a discrete subset of their domain, Gibbs-with-gradient proposal [Grathwohl et al., 2021] and discrete Langevin proposal [Zhang et al., 2022a] use gradients to inform discrete updates directly for these discrete distributions rather than transport the discrete domain to a continuous one. However, most discrete distributions in the real world do not have a natural continuous extension, or the natural extension is still not differentiable. This is why we propose Newton proposal.
Locally-Balanced Proposal Based on local neighborhood information at the current location, the locally-balanced proposal [Zanella, 2020] is an informed framework that is applicable to both discrete and continuous spaces. When sampling from discrete distributions, it does not require natural differentiable extensions. Locally-balanced proposals have been extended to continuous-time Markov processes [Power and Goldman, 2019] and have been tuned via mutual information [Sansone, 2022]. It has also been used in Multiple-try Metropolis (MTM) algorithms to achieve fast convergence [Gagnon et al., 2022]. It is very expensive to construct locally-balanced proposals when the local neighborhood is large or the dimension is high, preventing them from making large moves in discrete spaces. The path auxiliary proposal [Sun et al., 2021] explores a larger neighborhood by making a sequence of small moves. An adaptive locally-balanced proposal (ALBP) [Sun et al., 2022] has been proposed to determine the update size automatically. However, it still only updates one coordinate per gradient computation and the update has to be done in sequence. On the contrary, our Newton proposal can update many coordinates in parallel.
3 PRELIMINARY
We consider sampling from a target distribution
[TABLE]
where is a -dimensional variable, is a finite variable domain, the energy function is a scalar-valued function, and is the normalizing constant for to be a distribution. In this paper, we restrict to a factorized domain, , , and mainly consider to be or , which correspond to the binary variable and the categorical one, respectively.
As we state in related work, Locally-balanced proposal [Zanella, 2020] does not require a natural continuous distribution, so it is a flexible framework to build efficient and informed proposals for discrete distributions:
[TABLE]
where is a continuous function from to itself satisfying . is a symmetric kernel and is a scale parameter.
When we set , and as the well-known Metropolis-Adjusted Langevin Algorithm (MALA) proposal [Roberts and Rosenthal, 1998], we get an informed proposal as
[TABLE]
where the local difference shows the likelihood ratio between a given input and other discrete states . In case where summing over the full space of is so expensive that the normalizing constant becomes intractable, the locally-balanced proposal often restricts its domain to a small neighborhood. For example, the Gibbs-with-gradient proposal [Grathwohl et al., 2021] only considers local moves inside a Hamming ball with small window sizes.
Finite Difference Finite Difference is a mathematical expression of the form , which is an approximation of derivatives. Specifically, a forward finite difference, denoted , of a function is defined as
[TABLE]
When the window size , can be omitted:
[TABLE]
As for the finite difference with respect to a vector, let’s consider as a -dimensional vector and as a scalar-valued function. The finite difference of with respect to the vector is defined as
[TABLE]
Specifically,
[TABLE]
where changes the -th coordinate from to and keeps the other coordinates the same as .
Newton’s Series Expansion As the discrete analog of the continuous Taylor expansion, Newton’s series expansion is used to approximate discrete functions. In Newton’s series expansion, we use finite differences instead of gradients to indicate neighborhood information.
Let’s consider the scalar version first. The Newton series consists of the terms of the Newton forward difference equation:
[TABLE]
where and represents -th order forward finite difference defined as
[TABLE]
Specifically, the first-order Newton’s series expansion is
[TABLE]
When and are -dimensional vectors, is also a -dimensional vector and the corresponding first-order Newton’s series expansion becomes
[TABLE]
In this paper, we use the first-order Newton’s series expansion under to approximate the likelihood of discrete moves.
4 EFFICIENT INFORMED PROPOSALS VIA NEWTON’S SERIES APPROXIMATION
To sample from discrete distributions efficiently, we propose Newton proposal. We use Newton’s series expansion to approximate the likelihood of making discrete updates, and factorize the discrete domain coordinatewise to reduce the computation cost significantly.
4.1 Informed Proposal via Newton’s Series Approximation
Consider a common -dimensional case . In each iteration, several coordinates are flipped and we update the current samples to . We use a first-order forward Newton’s series expansion with window size to approximate :
[TABLE]
We use the Newton’s series expansion in (6) to approximate the local difference in (3):
[TABLE]
We add a term in the fourth line to get the perfect square form because is independent of and will not affect the normalized result (see the appendix for detailed proof).
In this way, we obtain the new proposal in a perfect square form by Newton’s series expansion:
[TABLE]
where the normalizing constant is summed over :
[TABLE]
In a word, finite difference in Newton’s series approximation serves as a guide when exploring the discrete space, similar to what gradients do in proposals for continuous distributions. It provides neighborhood information about the target distribution so that the sampler can propose new states more informatively rather than “blindly”. Moreover, the perfect square form gives us possibility to accelerate the computation without restrict our domain in a small neighborhood, which we will discuss in detail later.
4.2 Efficient Newton Proposal via Coordinatewise Factorization
The computation cost of the informed proposal in (8) depends on the normalizing constant in (4.1), which needs to sum up all states in the discrete space. Therefore, it is desirable to narrow down the space to make tractable.
Unlike most locally-balanced proposals which restrict the domain to a small neighborhood, , a hamming ball [Zanella, 2020, Grathwohl et al., 2021], a key feature of the proposal (8) is that it is displayed as a Euclidean norm and can be factorized coordinatewise [Zhang et al., 2022a]. To see this, we write (8) as where is a simple categorical distribution:
[TABLE]
Here, stands for categorical distribution, denotes SoftMax function and . Recall that where changes the -th coordinate from to while keeping the other coordinates the same as .
Since both the domain and the proposal over all coordinates in (8) can be factorized coordinatewisely, we can update each coordinate in parallel, thus greatly speeding up the computation. In this way, Newton proposal fills in the blank in bottom right of Table 1. It enables us to sample from high-dimensional complex discrete distributions with better mixing and faster convergence, no matter whether the discrete distribution has an natural differentiable extension.
When modeling the proposal distribution over all coordinates jointly, the overall cost of constructing the proposal in (8) is for . Thanks to the coordinatewise factorization, the cost of Newton proposal is reduced to . This allows the sampler to explore the full space with the neighborhood information without paying a prohibitive cost.
4.3 A Variant: With a MH Correction
It is optional to add a Metropolis-Hastings (MH) step [Metropolis et al., 1953, Zhang et al., 2022a], which is usually combined with proposals to make the Markov chain reversible. Specifically, after generating the next position from a distribution , the MH step accepts it with probability
[TABLE]
By rejecting some of the proposed states, the Markov chain is guaranteed to converge asymptotically to the target distribution. The sampler with our Newton proposal is outlined in Algorithm 1.
We call Newton proposal without the MH step as unadjusted Newton algorithm (UNA) and that with the MH step as Metropolis-adjusted Newton algorithm (MANA). Similar to MALA and ULA in continuous spaces [Grenander and Miller, 1994, Roberts and Stramer, 2002], MANA contains (for finite difference computation) plus (for the MH correction) function evaluations and is guaranteed to converge to the target distribution. Although UNA may have asymptotic bias, it only requires function evaluations, which is valuable especially when performing the function evaluation is expensive such as in large-scale Bayesian inference [Welling and Teh, 2011, Durmus and Moulines, 2019].
5 AN ALTERNATIVE VIEW OF NEWTON PROPOSAL
After getting the Newton proposal via Newton’s series, we give an alternative way to derive Newton proposal via multilinear extension, which gives us more intuition about the efficient informed proposal. From the multilinear extension viewpoint, we find further connection with Discrete Langevin Proposal (DLP) [Zhang et al., 2022a].
5.1 An Equivalent Form via Multilinear Extension
In addition to approximating the likelihood ratio of flipping each dimension with Newton’s series expansion, we find that our Newton proposal can also be obtained by conducting Taylor expansion on the multilinear extension of the discrete distribution. This connection gives us another interesting viewpoint of the Newton proposal. We briefly show the equivalence between these two viewpoints in the binary case here and put the categorical case and detailed proof in the appendix.
Let us consider the -dimensional binary distribution. The coordinates can be denoted as a finite set . The sampling process corresponds to choosing which coordinate to flip, so the discrete distribution is a set function defined over the power set of as . A discrete distribution may not have a natural continuous extension, but its multilinear extension can always be defined as :
[TABLE]
As we can see, is a continuous, differentiable function which accepts real-valued inputs from the interval . This makes it possible to approximate the likelihood of discrete moves with Taylor series expansion. Besides, an inspiring fact is that keeps the same value with when they are evaluated on the discrete subset . For , since is linear in , we have the partial derivative of as:
[TABLE]
We can see that the partial derivative measures the difference between the energy function of the original state and the flipped one, which is exactly the finite difference of the two states. Since and take the same value on the discrete domain, the Newton proposal can be equivalently obtained by conducting Taylor expansion on the multilinear extension of the target discrete distribution.
As for categorical distribution, we need to decide not only which coordinate to flip, but also which level to flip to. Fortunately, the differentiable function can be obtained with a generalized multilinear extension [Sahin et al., 2020], on which the likelihood ratio of flipping each coordinate to any level can be defined. The detailed algorithm is in the appendix.
5.2 Comparison with Discrete Langevin Proposal
Our Newton proposal and the Discrete Langevin Proposal (DLP) [Zhang et al., 2022a] can be seen as two different approximations of the locally-balanced proposal [Zanella, 2020] when taking functions and like MALA, as shown in (3).
DLP is motivated by utilizing gradients to guide the sampling and inference. Thus it requires that the discrete distribution can be displayed as a differentiable function which is only evaluated on a discrete domain, , Ising models, so that the gradient can be defined. In this way, DLP can be viewed as a first-order Taylor series approximation to the local difference term inside in (3) with:
[TABLE]
In contrast, our Newton proposal circumvents this shortcoming by using Newton’s series expansion, a natural tool to approximate discrete functions. Specifically, Newton proposal approximates the local difference in with:
[TABLE]
The finite difference is defined on the grid-like discrete domain and can also guide to explore the discrete space like what gradients do in continuous relaxation-based methods.
In addition to the differences in methods and requirements, our Newton proposal has more general applications than DLP: (1) When the discrete distribution has a natural differential extension, such as Ising model, Restricted Boltzmann Machines (RBMs) and Potts model, DLP and the Newton proposal both work. In some special cases such as some Ising models, when the multilinear extension and the natural continuous extension of the original discrete distribution are the same, DLP and Newton proposal have the same results. (2) The strict requirement about natural differential relaxation limits DLP to be widely used in more complex scenarios. When the target discrete distribution lacks a differentiable extension, DLP does not work while the Newton proposal is still well-suited for these tasks, such as the facility location problem, text summarization, etc.
For the computation cost, DLP contains one gradient computation whose cost depends on and , while the Newton proposal contains function evaluations. Both gradient and function computations can be done in parallel.
6 THEORETICAL ANALYSIS
In this section, we analyze the asymptotic convergence of UNA and the asymptotic efficiency of MANA. Specifically, we first prove in Section 6.1 that when the stepsize , the asymptotic bias of UNA is zero when the discrete distribution is a second-order modular function, whose gain reduction keeps the same for any set [Korula et al., 2018] (see the rigorous definition in the appendix). Later in Section 6.2, we derive the asymptotic efficiency of MANA.
6.1 Asymptotic Convergence of UNA for Second-order Modular Functions
Besides the property of the proposal itself, the effectiveness of a proposal also depends on how close its underlying stationary distribution is to the target distribution because if it is far, even if using the MH step to correct the bias, the acceptance probability will be very low. We consider a second-order modular distribution, which appears in common tasks such as Ising models. The following theorem summarizes UNA’s asymptotic accuracy for such discrete distributions.
Theorem 1. When the discrete distribution is a second-order modular function [Korula et al., 2018], its multilinear extension is quadratic which can be expressed as . The Markov chain following transition in (9) (i.e. UNA) is reversible with respect to some distribution and converges weakly to as . In particular, let be the smallest eigenvalue of , then for any ,
[TABLE]
Theorem 1 shows that the asymptotic bias of UNA decreases at a rate which vanishes to zero as the stepsize . Besides, the asymptotic bias of UNA is related to the smallest eigenvalue of .
Example. Let’s consider the well-known model in thermodynamic systems, the Ising model. Note that its distribution is second-order modular:
[TABLE]
where represents the interaction between any two adjacent sites , and a site has an external magnetic field interacting with it. We run UNA with varying stepsizes on a 2 by 2 Ising model, as shown in Figure 1. For each stepsize, we run the chain long enough to ensure its convergence. The results clearly show that the distance between the stationary distribution of UNA and the target distribution decreases as the stepsize decreases.
6.2 Asymptotic Efficiency of MANA
To understand the asymptotic efficiency of MCMC transition kernels, we study the asymptotic variance and the spectral gap of the kernel. The asymptotic variance is defined as
[TABLE]
where is a scalar-valued function, is a -stationary Markov transition kernel. The asymptotic variances measures the additional variance incurred when using sequential samples from to estimate . The spectral gap is defined as
[TABLE]
where is the second largest eigenvalue of the transition probability matrix of . For transition probability matrices with non-negative eigenvalues, the spectral gap is related to the mixing time, with larger values corresponding to faster mixing [Levin and Peres, 2017].
Since our method approximates in (2), we should expect some decrease in efficiency. We characterize this decrease in terms of the asymptotic variance and the spectral gap, under the Lipschitz-like assumption on . In particular, we show that the decrease is a constant factor that depends on the Lipschitz constant of and the dimension of the target distribution.
Theorem 2 Let and be the Markov transition kernels given by the Metropolis-Hastings algorithm using the locally-balanced proposal and our approximation . Let the finite difference has an Lipschitz-like property with constant , and . Then it holds
. 2. 2.
where and .
Remark. We can see that the constant is correlated with the dimension of the target discrete distribution. In high-dimensional scenarios, will be a large constant, leading to loose bounds of the asymptotic variance and spectral gap. However, on one hand, there is a gap between theory and experiment [Kwisthout and Van Rooij, 2013]. It may be hard to achieve the bound in practice. On the other hand, we can add a slight restriction on the number of changed coordinates in a single step. In this way, can be reduced to a small number as we expect, and we can get tighter bounds in theory.
7 EXPERIMENTS
We conduct a comprehensive empirical evaluation for Newton proposal on synthetic and real-world sampling tasks. The unadjusted and Metropolis-adjusted Newton proposals are denoted as UNA and MANA, respectively. We release the code at https://github.com/DongyaoZhu/Newton-Proposal-for-Discrete-Sampling. Baselines and evaluation tasks are described below.
Baselines. We compare the performance of Newton proposals with widely-used sampling methods for discrete distributions, including (1) two Gibbs-based methods: Gibbs sampling, Gibbs with Gradient (GWG) [Grathwohl et al., 2021]; (2) two methods which perform sampling in a continuous space by gradient-based methods and then transforms the collected samples to the original discrete space: discrete Stein Variational Gradient Descent (D-SVGD) [Han et al., 2020] and relaxed MALA (R-MALA) [Grathwohl et al., 2021]; (3) one gradient-based method which requires the target discrete distribution to have a differential relaxation: Discrete Langevin Proposal (DLP) [Zhang et al., 2022a] and (4) the locally-balanced sampler (LB) [Zanella, 2020]. Specifically, we denote DULA and DMALA for unadjusted and Metropolis-adjusted DLPs, respectively. All methods are implemented in Pytorch and we use the official release of code from previous papers when possible.
Evaluation Tasks. (1) Discrete distributions without natural differentiable extensions. Since GWG and DLP require gradients and can not be applied, we mainly compare the Newton proposal with Gibbs and LB in these tasks. (2) Discrete distributions with natural differentiable extensions. We also apply Newton proposal to these distributions such as the Ising model which is binary and Potts model which is categorical, to show the broad applicability of our method. In this scenario, we also compare Newton proposal with continuous relaxation methods (GWG and DLP) and the results are included in the appendix.
7.1 Facility Location
In the facility location task, we are given a set of facilities denoted and a set of customers to decide whether to open a facility or not [Krause et al., 2008], corresponding to sampling from a binary distribution. If the -th facility is opened, then it provides service of value to customer . We suppose that each customer chooses the opened facility with highest value, then the total value provided to all customers is . Besides, we penalize the number of selected facilities to ensure the most total utility with a small number of facilities. Therefore, the distribution of the facility location model can be represented as where is a hyperparameter, controlling the strength of the penalty term. To evaluate the Newton proposal on facility location task, we generate the utility matrix from a gaussian mixture model with and . We run 30000 iterations and set the stepsizes of MANA and UNA as 1 and 0.2, respectively.
We first compare the root-mean-square error (RMSE) between the estimated mean and the true mean under in Figure 2. The blue lines of MANA are both below other lines, indicating that MANA is the fastest to converge in terms of both iterations and running time. This demonstrates (1) sampling in the original discrete space is important: D-SVGD and R-MALA get poor results because this task is complex and the relaxed distributions are hard to sample from; (2) the finite difference makes the exploration over the discrete space more informative rather than “blind” compared to Gibbs; (3) the MH step enables MANA to make larger and more effective moves with a larger step size without worrying about unconvergence compared to UNA; (4) changing many coordinates in one step accelerates the convergence compared to LB and Gibbs. We then show that Newton proposal can change multiple coordinates in one iteration while still maintaining a high acceptance rate in Figure 3(a). When the stepsize , on average MANA can change 5.3 coordinates in one iteration with an acceptance rate in the MH step, while the acceptance rate of LB proposal is only . In Figure 3(b), we compare the effective sample size (ESS) for exact samplers (, having the target distribution as its stationary distribution). MANA outperforms other methods, indicating the correlation among its samples is low due to making significant updates in each step.
7.2 Extractive Text Summarization
Extractive text summaries are formed by selecting several sentences from source documents that best fit certain quality measurements. [Lin and Bilmes, 2011] designed their metrics on for both similarity and diversity, formally defined as subject to some cost constraint , where measures the coverage or "fidelity" of summary set to the document, rewards diversity in , and is a trade-off coefficient. The undifferentiable distribution of the summary makes the task well-suited for our methods based on pseudo-gradients. Futhermore, due to the limited time constraint, the proposals with non-parallel updates cannot explore enough into the distribution, while our Newton proposal will be able to avoid this issue with multidimensional updates. We use an exponential decay schedule on step size to encourage faster convergence under limited time constraint. The final result is then given by a sample-wise majority vote algorithm [Wang et al., 2022] on the collection of samples we produce.
We report results of MANA, LB and Gibbs sampler on DUC 2002 dataset [Over and Liggett, 2002], which contains about 30 sentences per document. ROUGE scores [Lin, 2004] are widely used for text summarization evaluation, and we compare ROUGE-2 scores (precision , recall , and F-measure ) of different samplers. In addition, we show the average objective scores at 500th, 750th and 1000th iteration, respectively. As shown in Table 2, Newton proposal constantly outputs highest and ROUGE-2 scores under limited time constraints.
7.3 Image Retrieval
Given a database of images and a query image, we look for a subset from the database that best matches the query. We follow the same settings in the extractive text summarization experiments, and we use the discontinuous score function proposed by [Yang et al., 2014] to measure the matching of a particular collection of images to a query image. We empirically found that sample-wise majority vote algorithm [Wang et al., 2022] did not perform well, thus we also propose a dimension-wise majority vote algorithm: given a collection of samples , each dimension will have a count of , and the dimensions of top counts are selected (details in appendix).
We evaluate various methods on the INRIA Holidays Dataset [Jegou et al., 2008] which consists of 1491 images (dimensions) and 500 queries. Our results in values and mean Average Precision (mAP) are reported in Table 3. The shows that the sampler with Newton proposal quickly reaches and stably keeps a better solution to the maximization problem than other methods despite limited time constraint. Our high mean Average Precision demonstrates that our solution is a better approximation to the ground truth labels compared to other methods.
8 CONCLUSION
We propose a new gradient-like efficient informed proposal, the Newton proposal, for general discrete distributions. This proposal better explores discrete spaces under the guidance of the finite difference produced by Newton’s series expansion, which does not require natural differentiable expansions. Additionally, the factorization on coordinates allows multiple coordinates to be updated simultaneously, leading to a faster convergence rate. To the best of our knowledge, Newton proposal makes the first attempt to utilize Newton’s series expansion and multilinear extension in discrete sampling, which fills the gap of efficient sampling for complex discrete distributions when gradients are not available. For different application scenarios, we develop several variants with Newton proposal, including unadjusted and Metropolis-adjusted versions. We theoretically prove the convergence and efficiency of Newton proposal without and with the MH step. Empirical results on various problems demonstrate the superiority of our method over baselines in general settings.
Acknowledgments
We would like to thank Yingzhen Li for helpful discussions and the anonymous reviewers for their thoughtful comments on the manuscript.
Appendix A Detailed Derivation of Newton Proposal
We give more details about the derivation of our Newton proposal.
We start from the MALA-like locally-balanced proposal [Zanella, 2020] we mentioned in Section 3:
[TABLE]
where is the normalizing constant. When we use Newton’s series expansion to approximate the local difference , we get
[TABLE]
where the second line is because is independent of and will not affect the normalized result.
Since (13) is actually an -2 norm, can be factorized coordinatewisely. Besides, we assume the domain can be factorized coordinatewise. Therefore, we can factorize in (7) as and
[TABLE]
Then we get our Newton proposal which is easy to compute in parallel:
[TABLE]
Appendix B Algorithm for Binary Variables
When the variable domain is binary , if we flip any coordinate to , is always 1. Thanks to the coordinatewise factorization, the sample space only contains 2 states: flipping or remaining the original state, which makes the normalizing constant tractable. In this way, we could simplify Algorithm 1 in the main body of our paper further and obtain the following algorithm, which clearly shows that our method can be cheaply computed in parallel on CPUs and GPUs. We give the pseudo code when sampling from binary distributions with Newton proposal as follows.
Appendix C Newton Proposal for Categorical Variables
C.1 Method 1: Newton’s Series Approximation
When using one-hot vectors to represent categorical variables, our Newton proposal becomes
[TABLE]
where are one-hot vectors.
If the variables are ordinal with clear ordering information, we can also use integer representation and compute the Newton proposal as in Equation (15).
C.2 Method 2: Multilinear Extension
As we stated in Section 5.1, for binary variables, our Newton proposal obtained from the first-order Newton’s series approximation is equivalent to that obtained from the first-order Taylor series approximation to the multilinear extension of the original discrete target distribution. For categorical variables, we not only need to decide which coordinate to flip, but also need to decide the level. By introducing the concepts of ’multiset’, we generalize the multilinear extension to categorical distributions and thus extend our Newton proposal to categorical variables.
C.2.1 Multiset
In classical sets, distinct elements can only occur once. A multiset is a natural generalization of a set, where elements can be contained repeatedly. The number of times an element occurs is called the multiplicity of the element . A multiset is defined as a pair , where is the support and is a function defining multiplicity for each element [Sahin et al., 2020]. Given this definition, we can use the integer vector of the multiset’s multiplicity to represent any multiset. Also, we can transfer several important notions from multisets to integer vectors, such as the notion of a subset, set intersection, set union and set difference.
Now consider sampling from a -dimensional discrete distribution. The support can be represented as , where and the discrete function is an integer function defined as : . For ease of notation, we assume that does not depend on and is the same along each dimension, , , . The obtained sample will be an integer vector (, ) and can be equivalently represented as a multiset. Note that the integer () can be also represented as a binary vector of length . Take the case as an example. can be , or , corresponding to the level of 0, 1, 2, respectively. For simplicity in the calculation, we will use the binary representation in the following parts.
C.2.2 Generalized Multilinear Extension
Given the discrete distribution , the generalized multilinear extension will extend to a continuous domain while keeping the values on the original discrete domain.
Let be the marginals of a -dimensional categorical distribution and is the concatenation of all vectors. Each lives in the dimensional simplex . The simplex is defined as
[TABLE]
We define the union of simplexes as , and naturally . Once we sample from , we get . The generalized multilinear extension is defined on the space of the product of categorical distributions and can be written as:
[TABLE]
We need to compute the sum of elements to compute the expectation in Equation (17). Note that when , this extension corresponds to the multilinear extension of a set function. Here is an example with and . In this case, we have two categorical distributions which take three different values.
[TABLE]
where we have the following constraints
[TABLE]
C.2.3 Newton Proposal via Generalized Multilinear Extension
Similar to the multilinear extension for the binary domain, we can calculate the first-order partial derivative of the generalized multilinear extension of , , . Given , let be a random multiset where elements appear independently with probabilities . Since is multilinear, the partial derivative can be written as a difference of two generalized multilinear extensions:
[TABLE]
where and corresponds to union and set difference between multisets, respectively. is a unit vector with the -th element 1. is the multiset whose -th element has multiplicity . In this way, we can get and calculate the proposal of each coordinate as below:
[TABLE]
C.3 Experiments
We implement the Newton proposal on the dim Potts model.
Figure 4 shows that MANA converges fast in both the number of iterations and running time. Our Newton proposal achieves promising performance for categorical variables.
Appendix D Details and Proof of Theorem 1
As we discuss in Section 5, our Newton proposal can be equivalently obtained by conducting Taylor expansion on the multilinear extension of the target discrete distribution. In this section, we focus on second-order modular functions and study the asymptotic convergence of UNA.
D.1 Definitions of Submodular Function and Multilinear Extension
A submodular function is a set function whose value has the property that the difference in the incremental value of the function that a single element makes when added to an input set decreases as the size of the input set increases. In a word, submodular functions have a natural diminishing returns property. If is submodular then its multilinear extension, , , is concave along any line .
For a discrete distribution whose function is a set function , its multilinear extension is defined as:
[TABLE]
It is possible to relate properties of to properties of its multilinear extension . In particular, we have:
Proposition 1. Let be the multilinear extension of , then:
If is non-decreasing, then is non-decreasing along any direction . 2. 2.
If is submodular then is concave along any line .
Both properties can be established by first looking at how behaves along coordinates axes. We first calculate the first and second order derivative of .
Let , since is linear in , we have:
[TABLE]
Let be the random subset of where each element is included with probability , then we can rewrite:
[TABLE] 2. 2.
Similarly, let be the random subset of and be the random subset of where each element is included with probability , we have:
[TABLE]
By submodularity of , the last quantity in (21) is non-positive, , .
Let and . We define the function of the real variable . We note that and .
If is non-decreasing, then and . Hence is nondecreasing. 2. 2.
If is submodular, then and . Hence is concave.
D.2 Second-Order Modular
For a submodular function , let denote the marginal gain from adding element to set . For sets , we define as the amount by which reduces the marginal gain from adding e to . (Here, GR stands for Gain Reduction.) Note that by definition of submodularity, is always non-negative.
The function is said to be second-order modular if, for all sets such that , and , and all elements , we have: [Korula et al., 2018].
Specifically, if for any set , , and where and are both constants, then . At this time, is second-order modular. That is to say, the second-order modular function is in the form of , whose multilinear function is
[TABLE]
which is exactly in the same form with the energy function of Ising model.
D.3 Proof of Theorem 1
Proof.
We finish the proof in the view of multiliear extension, , we see the multilinear extension of the original discrete distribution as the energy function. We first prove the weak convergence and then prove the convergence rate with respect to the stepsize .
(1) Weak convergence. When is second-order modular, the Hessian matrix of its multilinear extension in Equation (22) will be a constant, , . We have that . Since is a constant, we can rewrite the proposal distribution as the following
[TABLE]
where the last equation is because the Taylor expansion .
Let , and , now we will show that is reversible w.r.t. . We have that
[TABLE]
We can see that the expression in (23) is symmetric in and . Therefore is reversible and the stationary distribution is . Now we will prove that converges weakly to as . Notice that for any ,
[TABLE]
where is a Dirac delta. It follows that converges pointwisely to . By Scheffé’s Lemma, we attain that converges weakly to .
(2) Convergence Rate w.r.t. Stepsize.
Let us consider the convergence rate in terms of -norm
[TABLE]
We write out each absolute value term
[TABLE]
Since , it follows that
[TABLE]
We also notice that , thus when , we get
[TABLE]
Similarly, when , we have,
[TABLE]
Therefore, the difference between and can be bounded as follows
[TABLE]
∎
Appendix E Proof of Theorem 2
Proof.
Our proof follows from Theorem 1 of [Grathwohl et al., 2021] and Theorem 2 of [Zanella, 2020], which state that for two -reversible Markov transition kernels and , if there exists for all such that then
2. 2.
where is the asymptotic variance and is the spectral gap, which are both defined in the main body of this paper. is the standard variance . Our proof proceeds by showing we can bound , and the results of the theorem then follow directly from Theorem 2 of [Zanella, 2020].
E.1 Definitions
We begin by writing down the proposal distribution of interest and their corresponding Markov transition kernels. For ease of notion we define some values
[TABLE]
Then our original proposal, , the MALA-like locally-balanced proposal for is
[TABLE]
where we have defined
[TABLE]
When we examine the acceptance rate of the proposal we find
[TABLE]
Then the acceptance rate of the target proposal in (24) can be simplified as
[TABLE]
This corresponding Markov transition kernel is
[TABLE]
Our proposed Newton proposal is the first-order Newton’s series approximation of the original target proposal (24) for :
[TABLE]
where we have defined
[TABLE]
For our Newton proposal, we simplify the term in the acceptance rate of the proposal as
[TABLE]
Then the Markov transition kernel
[TABLE]
E.2 Preliminaries
It can be seen that is a first-order Newton’s series approximation to . When the finite difference has an analog of Lipschitz continuity, , , since is bounded, we have
[TABLE]
E.3 Normalizing Constant Bounds
We derive upper- and lower-bounds on in terms of .
[TABLE]
Following the same argument we can show
[TABLE]
In this way, we get bounds between the normalizing constants of original proposal and our approximated proposal.
E.4 Inequalities of Minimums
We show for . Since both and , it is sufficient to show and to prove the desired result. We begin with the terms
[TABLE]
Now the terms
[TABLE]
E.5 Conclusions
We have shown that and and therefore it holds that
[TABLE]
From this, the main result follows directly from Theorem 2 of [Zanella, 2020]. ∎
Appendix F Experiments on Ising Model
In the main body of the paper, we have shown that for distributions without natural differentiable extension, our Newton proposal outperforms popular baselines, including Gibbs sampler and locally-balanced sampler (LB) [Zanella, 2020], while continuous relaxation-based proposals become valid. We also apply Newton proposal to discrete distributions with natural differentiable distributions, such as the Ising model, to show the broad applicability of our method. In this case, we compare our Newton proposal with proposals which do not rely on gradients (Gibbs sampler and LB) and continuous relaxation methods (GWG[Grathwohl et al., 2021] and DLP [Zhang et al., 2022a]).
F.1 Experiment Settings
We have shown in Section 5 that Newton proposal is equivalent to DLP when the discrete distribution has a natural differential extension. Therefore, if Newton proposal and DLP are applied to sample from the same discrete distribution, they will have the same results. However, we also demonstrate in Theorem 1 that the smallest eigenvalue of is related to the asymptotic convergence. Consider the Ising model whose distribution is
[TABLE]
where is a binary adjacency matrix, is the connectivity strength and is the bias. We can see the diagonal of matrix in Equation (25) will not affect the distribution in the discrete domain because and are different elements in the set . Meanwhile, DLP is applied to a discrete distribution in a 0-diagonal quadratic form, which can be seen as the multilinear extension of the distribution of Newton proposal when is 0-diagonal. We are interested in the performance of Newton proposal and DLP when in Newton proposal is larger than that in DLP.
F.2 Ising Model Sampling Results
We consider a 4 by 4 lattice Ising model with random variable , and . The distribution is
[TABLE]
where is a binary adjacency matrix, is the connectivity strength and is the bias.
We run 60000 iterations with all samplers. To make the comparison of convergence speed fair, we tune the stepsizes so that MANA and DMALA change almost the same number of coordinates in a single update. The stepsizes of MANA, DMALA, UNA and DULA as 0.5, 0.8, 0.1 and 0.1, respectively.
We first compare the root-mean-square error (RMSE) between the estimated mean and the true mean in Figure 5. MANA is the fastest to converge in terms of both iterations and runtime. This demonstrates (1) changing many coordinates in one step accelerates the convergence compared to LB and GWG; (2) the finite difference works like the gradient in cases with natural differential extensions to explore the discrete space. In fact, the Newton proposal is equivalent to DLP when the discrete distribution has a natural differential extension, as shown in Section 5. That’s why DMALA and MANA achieve similar convergence when they change the same number of coordinates in a single update. MANA and DMALA converge obviously faster than UNA and DULA because the MH correction accelerates the convergence for this task. In Figure 5(c), we compare the effective sample size (ESS) of different samplers. MANA and DMALA significantly outperform other methods, indicating the correlation among its samples is low due to making significant updates in each step.
Appendix G ROUGE Score in Text Summarization
We used the official implementation of Rouge score evaluation toolkit (https://pypi.org/project/rouge-score/). We follow the same ROUGE score configuration as the settings in [Lin, 2004]: ROUGE version 1.5.5 with options: -a -c 95 -b 665 -m -n 4 -w 1.2.
Appendix H Dimension-wise Majority Vote in Image Retrieval
For the image retrieval task, the pseudo code for the dimension-wise majority vote algorithm mentioned in the main body of our paper is as follows:
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Wang and Cho [2019] Alex Wang and Kyunghyun Cho. Bert has a mouth, and it must speak: Bert as a markov random field language model. ar Xiv preprint ar Xiv:1902.04094 , 2019.
- 2Gu et al. [2017] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Non-autoregressive neural machine translation. ar Xiv preprint ar Xiv:1711.02281 , 2017.
- 3Wang et al. [2010] Jianrong Wang, Ahsan Huda, Victoria V Lunyak, and I King Jordan. A gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags. Bioinformatics , 26(20):2501–2508, 2010.
- 4Courbariaux et al. [2016] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. ar Xiv preprint ar Xiv:1602.02830 , 2016.
- 5Peters and Welling [2018] Jorn WT Peters and Max Welling. Probabilistic binary neural networks. ar Xiv preprint ar Xiv:1809.03368 , 2018.
- 6Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning , pages 1263–1272. PMLR, 2017.
- 7Zhang et al. [2022 a] Ruqi Zhang, Xingchao Liu, and Qiang Liu. A langevin-like sampler for discrete distributions. In International Conference on Machine Learning , pages 26375–26396. PMLR, 2022 a.
- 8Roberts and Tweedie [1996] Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli , pages 341–363, 1996.
