Trace Ratio Optimization with Application to Multi-View Learning

Theoretical and computational analysis of trace ratio optimization on Stiefel manifold with applications to Fisher's LDA, canonical correlation analysis, and multi-view subspace learning.
hashratetoken.net | PDF Size: 0.8 MB

1. Introduction

This comprehensive research paper investigates the trace ratio optimization problem over the Stiefel manifold from both theoretical and computational perspectives. The fundamental problem addressed is the maximization of a trace ratio function defined as fα(X) = trace(XTAX + XTD) / [trace(XTBX)]α, where X belongs to the Stiefel manifold On×k = {X ∈ Rn×k : XTX = Ik}. The matrices A and B are symmetric n×n matrices with B being positive semi-definite and having rank greater than n-k, D is an n×k matrix, and the parameter α ranges between 0 and 1. The condition rank(B) > n-k ensures the denominator remains positive for all feasible X.

The Stiefel manifold optimization framework provides a rigorous mathematical foundation for solving this class of problems, which has significant implications across multiple domains of data science and machine learning. The research establishes necessary conditions in the form of nonlinear eigenvalue problems with eigenvector dependency and develops convergent numerical algorithms based on the self-consistent field (SCF) iteration.

1.1 Previous Work

The paper identifies and analyzes three significant special cases that have been extensively studied in previous literature:

Fisher's Linear Discriminant Analysis

With D = 0 and α = 1, the problem reduces to maxX∈On×k trace(XTAX) / trace(XTBX), which arises in Fisher's linear discriminant analysis for supervised learning. Previous approaches converted this into a zero-finding problem: solve φ(λ) = 0 where φ(λ) := maxX∈On×k trace(XT(A - λB)X). The function φ(λ) is proven to be non-increasing and typically has a unique zero, which can be found using Newton's method. The Karush-Kuhn-Tucker (KKT) conditions lead to a nonlinear eigenvalue problem (NEPv): H(X)X = XΛ, where H(X) is a symmetric matrix-valued function of X and Λ = XTH(X)X.

Orthogonal Canonical Correlation Analysis

With A = 0 and α = 1/2, the problem becomes maxX∈On×k trace(XTD) / √trace(XTBX), which emerges in orthogonal canonical correlation analysis (OCCA). This formulation serves as the kernel of an alternating iterative scheme. The KKT conditions for this case don't immediately take the NEPv form but can be equivalently transformed into one, enabling solution via SCF iteration with appropriate post-processing.

Unbalanced Procrustes Problem

The third special case connects to the unbalanced Procrustes problem, though less explicitly detailed in the provided excerpt. All three cases demonstrate the broad applicability of the trace ratio optimization framework across diverse statistical learning paradigms.

2. Problem Formulation

The general trace ratio optimization problem is formally defined as:

Problem (1.1a): maxXTX=Ik fα(X)

where: fα(X) = [trace(XTAX + XTD)] / [trace(XTBX)]α

The parameters satisfy: 1 ≤ k < n, Ik is the k×k identity matrix, A, B ∈ Rn×n are symmetric with B positive semi-definite and rank(B) > n-k, D ∈ Rn×k, matrix variable X ∈ Rn×k, and parameter 0 ≤ α ≤ 1.

The paper also notes that a seemingly more general case with an additional constant c in the numerator can be reformulated as a special case of Problem (1.1) through algebraic manipulation, demonstrating the comprehensiveness of the proposed framework.

3. Theoretical Foundations

The research establishes several fundamental theoretical results:

Necessary Conditions

The necessary optimality conditions for the trace ratio optimization problem are derived as nonlinear eigenvalue problems with eigenvector dependency (NEPv). For the special case of Fisher's LDA (α=1, D=0), the NEPv takes the form H(X)X = XΛ, where H(X) = A - λ(X)B and λ(X) = trace(XTAX)/trace(XTBX).

Existence and Uniqueness

For Problem (1.3) (Fisher's LDA case), it is proven that there are no local maximizers—only global ones exist. This important property ensures that any convergent algorithm will reach a globally optimal solution.

Geometric Interpretation

The optimization occurs over the Stiefel manifold, which has rich geometric structure. The convergence of algorithms is analyzed in terms of the Grassmann manifold Gk(Rn) (the collection of all k-dimensional subspaces of Rn), providing a geometric perspective on the optimization process.

4. Numerical Methods

The paper proposes and analyzes the self-consistent field (SCF) iteration for solving the trace ratio optimization problem:

SCF Algorithm

The basic SCF iteration for Problem (1.3) is: H(Xi-1)Xi = XiΛi-1, starting with an initial X0. Here, Xi is an orthonormal basis matrix associated with the k largest eigenvalues of H(Xi-1). This method has roots in electronic structure calculations where it has been successfully applied for decades.

Convergence Properties

The research establishes strong convergence guarantees for the SCF iteration:

  • Monotonic Convergence: The objective value increases monotonically with each iteration
  • Global Convergence: The iterates Xi converge globally to a maximizer in the metric on the Grassmann manifold
  • Quadratic Convergence: In generic cases, the convergence is locally quadratic, ensuring fast convergence near the optimum

Implementation Considerations

For the OCCA case (α=1/2, A=0), the SCF iteration requires a post-processing step on Xi to handle the different structure of the optimality conditions. The method remains monotonically convergent in the objective value, with iterates converging to a critical point satisfying necessary conditions for global optimality.

5. Multi-View Learning Application

As a significant application, the paper proposes a new framework for multi-view subspace learning:

Framework Design

The trace ratio optimization provides a mathematical foundation for developing novel multi-view learning models that can effectively integrate information from multiple data representations or modalities.

Concrete Models

The framework is instantiated into specific computational models that leverage the trace ratio formulation to learn subspaces that capture complementary information across different views of the same data.

Theoretical Advantages

The approach offers theoretical advantages over existing multi-view learning methods by providing optimality guarantees through the well-established trace ratio optimization framework.

6. Experimental Results

Algorithm Efficiency

The proposed numerical methods demonstrate computational efficiency in practical applications

Model Effectiveness

New multi-view subspace learning models show effectiveness on real-world datasets

Convergence Performance

SCF iteration exhibits reliable convergence across diverse problem instances

Experimental validation on real-world datasets confirms both the computational efficiency of the proposed numerical methods and the effectiveness of the new multi-view subspace learning models. The SCF iteration demonstrates robust performance across various problem instances, while the multi-view learning framework shows improved capability in integrating information from multiple data views compared to existing approaches.

7. Key Insights

  • Unified Framework: The trace ratio optimization provides a unified mathematical framework encompassing several important data science problems including Fisher's LDA, canonical correlation analysis, and the unbalanced Procrustes problem
  • Theoretical Guarantees: Strong theoretical guarantees including global convergence and absence of local optima in certain cases distinguish this approach from many other manifold optimization methods
  • Practical Algorithm: The SCF iteration offers a practical, efficient algorithm with proven convergence properties for solving these challenging optimization problems
  • Broad Applicability: The framework's applicability to multi-view learning demonstrates its relevance to contemporary machine learning challenges involving multiple data representations
  • Geometric Optimization: The Stiefel manifold formulation provides a geometrically natural framework for subspace learning problems

8. Conclusion

This research makes significant contributions to both the theory and computation of trace ratio optimization over the Stiefel manifold. The establishment of necessary conditions in the form of NEPv, the development and convergence analysis of the SCF iteration, and the application to multi-view subspace learning collectively represent substantial advances in the field. The unified framework connects several previously disparate problems in data science, while the theoretical guarantees and efficient algorithms provide practical tools for solving these important optimization problems. The successful application to multi-view learning demonstrates the real-world impact of this theoretical work, opening avenues for further research in structured optimization for machine learning.

The trace ratio optimization framework continues to offer promising directions for future research, including extensions to more complex manifold structures, applications to emerging learning paradigms, and development of even more efficient numerical methods leveraging recent advances in optimization and numerical linear algebra.