DeCLaRe Lab was founded in 2019, so this archive focuses on lab publications from 2019 onward. Use keyword search, citation status, PDF availability, or category filters. Links are visible by default; abstracts expand inline when available.

165 papers since 2019

2026

10 Open Challenges Steering the Future of Vision-Language-Action Models

S Poria, N Majumder, CY Hung, AA Bagherzadeh, C Li, K Kwok, Z Wang

AAAI 2026
Due to their ability to follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models—multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. We also discuss emerging trends such as spatial understanding, modeling world dynamics, post-training, and data synthesis, aiming to accelerate development of VLA models.

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Z Wu, Q Zha, T Wang, G Xu, W Gu, W Rao, N Ma, B Cheng, S Poria

ICML 2026
Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

Data Agent: Learning to Select Data via End-to-End Dynamic Optimization

S Yang, F Su, H Gan, Z Ye, J Li, B Xu, F Shen, S Poria

ICML 2026
Dynamic Data selection aims to accelerate training by prioritizing informative samples during online training. However, existing methods typically rely on task-specific handcrafted metrics or static/snapshot-based criteria to estimate sample importance, limiting scalability across learning paradigms and making it difficult to capture the evolving utility of data throughout training. To address this challenge, we propose Data Agent, an end-to-end dynamic data selection framework that formulates data selection as a training-aware sequential decision-making problem. The agent learns a sample-wise selection policy that co-evolves with model optimization, guided by a composite reward that integrates loss-based difficulty and confidence-based uncertainty signals. The reward signals capture complementary objectives of optimization impact and information gain, together with a tuning-free adaptive weighting mechanism that balances these signals over training. Extensive experiments across a wide range of datasets and architectures demonstrate that Data Agent consistently accelerates training while preserving or improving performance, e.g., reducing costs by over 50\% on ImageNet-1k and MMLU with lossless performance. Moreover, its dataset-agnostic formulation and modular reward make it plug-and-play across tasks and scenarios, e.g., robustness to noisy datasets, highlighting its potential in real-world scenarios.

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

M Song, R Liu, X Wang, Y Jiang, P Xie, F Huang, J Zhou, D Herremans, S Poria

ICLR 2026
RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems

R Zhou, M Song, X Wu, S Cheng, X Yin, Y Xie, Z Hao, W Hua, L Pan, S Poria

arXiv preprint arXiv:2601.21742 2026
Individual agents in multi-agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly-built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3-4B to outperform a history-agnostic baseline 8x its size (Qwen 3-30B) by accurately identifying reliable peers. ECL also boosts frontier models to near-perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Y Wu, M Song, Y Lan, L Wang, Z Hu, Y Xiao, H Zhou, W Zheng, D Raharja, S Poria

arXiv preprint arXiv:2602.21015 2026
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

LLM Alignment should go beyond Harmlessness–Helpfulness and incorporate Human Agency

U Naseem, T Chakraborty, KW Chang, M Dras, P Nakov, N Peng, S Poria

Cognitive Computation 2026
Large Language Models are transforming communication, research, and decision-making, but misalignment – when models diverge from human values, safety requirements, or user intent – poses serious risks. In this position paper, we argue that many alignment failures stem from operational choices in training and deployment. We posit that alignment should shift from static, post-training constraints toward dynamic, participatory approaches that safeguard pluralism, autonomy, and human flourishing. We outline forward-looking directions, including pluralistic evaluation, transparency, and the Flourishing–Justice–Autonomy (FJA) framework, and present a roadmap for advancing alignment research and practice.

LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

M Song, TD Pala, R Zhou, W Jin, A Zadeh, C Li, D Herremans, S Poria

ICLR 2026
Large language models (LLMs) are increasingly integrated into multi-agent systems (MAS), where peer interactions shape individual decisions. While prior work has mainly examined conformity bias, we broaden the view to include how LLMs build rapport from prior interactions, discern and integrate high-quality peer information, and resist misleading inputs-abilities essential for achieving collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark that simulates quiz-style collaboration with peer agents whose rapport levels and behaviours can be precisely controlled in both historical interactions and the current round. This unified setup enables systematic analysis of how rapport, peer actions, and the model's self-confidence jointly influence decision-making. Using KAIROS, we evaluate prompting, supervised fine-tuning, and reinforcement learning via Group Relative Policy Optimisation (GRPO). Results show that model scale is a primary factor moderating susceptibility to social influence: larger models are more resilient and benefit from prompting-based mitigation, whereas smaller models remain vulnerable. Only carefully configured GRPO training yields consistent robustness and performance gains for small models.

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

J Lei, V Gumma, R Bhardwaj, SM Lim, C Li, A Zadeh, S Poria

ICLR 2026
Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLM's ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest models - Qwen-3 (235B) with 77.77% and Mistral (24B) with 79.96% - fall far short of reliable operational safety, while GPT models plateau in the 62-73% range, Phi achieves only mid-level scores (48-70%), and Gemma and Llama-3 collapse to 39.53% and 23.84%, respectively. While operational safety is a core model alignment issue, to suppress these failures, we propose prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents.

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

W Han, P Zhou, S Poria, S Yan

arXiv preprint arXiv:2603.04759 2026
The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).

Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization

CY Hung, N Majumder, Z Kong, A Mehrish, AA Bagherzadeh, C Li, S Poria

ICLR 2026 65 citations
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.

Tracking the Evolution of Multimodal Reasoning on Visual Puzzles

V Toh, YK Chia, D Ghosal, S Poria

Logical and Symbolic Reasoning in Language Models @ AAAI 2026 2026
The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models and compare them against leading open-source alternatives on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA. Our results reveal that the o-[n] series, particularly later iterations, significantly outperform both the GPT-[n] series and the evaluated open-source models, establishing clear performance tiers. Nonetheless, despite these substantial advancements, our findings highlight that even leading models face persistent challenges in perception and compositional reasoning.

2025

Action-guided prompt tuning for video grounding

J Wang, R Tsao, X Wang, X Wang, F Feng, S Tian, S Poria

Information Fusion 2025
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. We propose Action-Guided Prompt Tuning (AGPT), which includes a Prompt Exploration module to expand salient verb representations and an auxiliary action temporal prediction task with a temporal rank loss to simulate human perceptual segmentation. AGPT integrates seamlessly into existing models with minimal overhead and improves performance on video grounding benchmarks.

Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning

D Ghosal, VTY Han, CY Ken, S Poria

NAACL 2025
This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. The dataset covers topics such as boolean logic, combinatorics, graph theory, optimization and search, ensuring exact solutions are derivable algorithmically. Our investigation reveals that large language models such as GPT4V and Gemini exhibit limited performance on these puzzles, often near random in a multi-choice setup, highlighting the challenges of integrating visual, language, and algorithmic knowledge.

Content extraction based on hop distance within a graph model

Y Dong, Y Deng, J Zhang, F Gelli, T Lin, Y Zhuo, H Wang, S Poria

US Patent 12,277,788 2025
A method of categorizing text entries on a document can include determining, for each of a plurality of text bounding boxes in the document, respective text, respective coordinates, and respective input embeddings. The method may further include defining a graph of the plurality of bounding boxes, the graph comprising a plurality of connections among the plurality of bounding boxes, each connection comprising a first and second bounding box and zero or more respective intermediate bounding boxes. The method may further include determining a respective attention value for each connection according to a quantity of intermediate bounding boxes in the connection and, based on the respective attention values and a transformer-based machine learning model applied to the respective input embeddings and respective coordinates, determining output embeddings for each bounding box and, based on the respective output embeddings, generating a bounding box label for each bounding box.

DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

TB Abdur Rakib, A Mehrish, LK Soon, WH Lim, S Poria

AAAI 2025
Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under 3 turns with success rates exceeding 94% and, with a larger LLM prior, pushes success above 97% while markedly improving negotiation outcomes.

DiffPO: Diffusion-styled Preference Optimization for Inference Time Alignment of Large Language Models

R Chen, W Chai, Z Yang, X Zhang, Z Wang, T Quek, JT Zhou, S Poria

ACL 2025
Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (DiffPO), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, DiffPO avoids the time latency associated with token-level generation. Designed as a plug-and-play module, DiffPO can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that DiffPO achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, DiffPO demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.

Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning

Q Sun, P Hong, TD Pala, V Toh, U Tan, D Ghosal, S Poria

ACL 2025
Traditional reinforcement learning-based robotic control methods are often task-specific and fail to generalize across diverse environments or unseen objects and instructions. Visual Language Models (VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to generate actionable policies tailored to specific robotic embodiments. To address this, Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, EMMA-X. EMMA-X leverages our constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we introduce a trajectory segmentation strategy based on gripper states and motion trajectories, which can help mitigate hallucination in grounding subtask reasoning generation. Experimental results demonstrate that EMMA-X achieves superior performance over competitive baselines, particularly in real-world robotic tasks requiring spatial reasoning.

Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

TD Pala, P Sharma, A Zadeh, C Li, S Poria

EMNLP Findings 2025
Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching existing corpora with step-level labels. On PRMBench, PathFinder-PRM achieves state-of-the- art PRMScore while using substantially less data, and improves reward-guided search performance.

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

J Lei, D Zhang, S Poria

arXiv preprint arXiv:2512.12602 2025
Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, full parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving linear-time complexity. Through experiments, we show EFLA enables robust performance in noisy environments and superior downstream benchmark performance.

Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk Appetite?

D Chawla, A Bhutada, DA Do, A Raghunathan, V Sp, C Guo, DW Liew, S Poria

EMNLP 2025
We assess whether AI systems can credibly evaluate investment risk appetite—a task that must be thoroughly validated before automation. Our analysis was conducted on proprietary systems (GPT, Claude, Gemini) and open-weight models (LLaMA, DeepSeek, Mistral), using carefully curated user profiles that reflect real users with varying attributes such as country and gender. As a result, the models exhibit significant variance in score distributions when user attributes—such as country or gender—that should not influence risk computation are changed. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles. While some models align closely with expected scores in the low- and mid-risk ranges, none maintain consistent scores across regions and demographics, thereby violating AI and finance regulations.

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

P Hong, N Majumder, D Ghosal, S Aditya, R Mihalcea, S Poria

ACL (Findings) 2025
Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, GSMORE and HUMANEVAL-CORE, respectively, of perturbed math and coding problems to probe LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open-source the datasets and source codes at: https://github.com/declare-lab/LLM-ReasoningTest.

Ferret: Faster and effective automated red teaming with reward-based scoring technique

TD Pala, VYH Toh, R Bhardwaj, S Poria

EMNLP (Findings) 2025
Automated red-teaming generates adversarial attacks to identify vulnerabilities in LLMs but can be slow and resource-intensive. We propose Ferret, which enhances baseline Rainbow Teaming by producing multiple adversarial prompt mutations per iteration and ranking them with scoring functions (reward models, Llama Guard, LLM-as-a-judge). Ferret achieves higher attack success rates, greater transferability across models, and faster time-to-target, with code available at https://github.com/declare-lab/ferret.

From grounding to manipulation: Case studies of foundation model integration in embodied robotic systems

X Sui, D Tian, Q Sun, R Chen, D Choi, K Kwok, S Poria

EMNLP (Findings) 2025
Foundation models (FMs) are increasingly applied to bridge language and action in embodied agents, yet the operational characteristics of different integration strategies remain under-explored. We investigate three paradigms for robotic systems: end-to-end vision-language-action models (VLAs), modular pipelines using vision-language models (VLMs), and multimodal large language models (MLLMs). Case studies on instruction grounding and object manipulation reveal trade-offs in scale, generalization and data efficiency, providing design lessons for language-driven physical agents and identifying opportunities and challenges for FM-powered robotics in real-world settings.

Harnessing large language models for scientific novelty detection

Y Liu, Z Yang, S Poria, TS Nguyen, E Cambria

arXiv preprint arXiv:2505.24615 2025
In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at https://anonymous.4open.science/r/NoveltyDetection-10FB/.

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

R Liu, CY Hung, N Majumder, T Gautreaux, AA Bagherzadeh, C Li, S Poria

arXiv preprint arXiv:2507.20880 2025
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.

Lessons from Training Grounded LLMs with Verifiable Rewards

SH Sim, TD Pala, V Toh, HL Chieu, A Zadeh, C Li, N Majumder, S Poria

arXiv preprint arXiv:2506.15522 2025
Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.

Libra-leaderboard: Towards responsible AI through a balanced leaderboard of safety and capability

H Li, X Han, Z Zhai, H Mu, H Wang, Z Zhang, Y Geng, S Lin, R Wang, S Poria

NAACL 2025
As large language models (LLMs) continue to evolve, leaderboards play a significant role in steering their development. Existing leaderboards often prioritize model capabilities while overlooking safety concerns, leaving a significant gap in responsible AI development. To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.

M-longdoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning framework

YK Chia, L Cheng, HP Chan, C Liu, M Song, SM Aljunied, S Poria, L Bing

EMNLP 2025
The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended explanations and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enhance open models, we construct a training corpus in a fully automatic manner. Experiments show that our tuning approach significantly improves the correctness of model responses by 4.6%.

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

M Song, SH Sim, R Bhardwaj, HL Chieu, N Majumder, S Poria

ICLR 2025
LLMs are an integral component of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the overall quality of end-to-end RAG systems, there is a gap in understanding the appropriateness of LLMs for the RAG task. To address this, we introduce Trust-Score, a holistic metric that evaluates the trustworthiness of LLMs within the RAG framework. Our results show that various prompting methods, such as in-context learning, fail to effectively adapt LLMs to the RAG task as measured by Trust-Score. Consequently, we propose Trust-Align, a method to align LLMs for improved Trust-Score performance. 26 out of 27 models aligned using Trust-Align substantially outperform competitive baselines on ASQA, QAMPARI, and ELI5. Specifically, in LLaMA-3-8b, Trust-Align outperforms FRONT on ASQA (up 12.56), QAMPARI (up 36.04), and ELI5 (up 17.69). Trust-Align also significantly enhances models' ability to correctly refuse and provide quality citations. We also demonstrate the effectiveness of Trust-Align across different open-weight models, including the LLaMA series (1b to 8b), Qwen-2.5 series (0.5b to 7b), and Phi3.5 (3.8b). We release our code at https://github.com/declare-lab/trust-align.

MM-InstructEval: Zero-shot evaluation of (Multimodal) Large Language Models on multimodal reasoning tasks

X Yang, W Wu, S Feng, M Wang, D Wang, Y Li, Q Sun, Y Zhang, X Fu, S Poria

Information Fusion 2025 61 citations
The emergence of multimodal large language models (MLLMs) has triggered extensive research in model evaluation. While existing evaluation studies primarily focus on unimodal (vision-only) comprehension and reasoning capabilities, they overlook critical assessments of complex multimodal reasoning tasks that require integrated understanding of both visual and textual contexts. Such multimodal tasks present unique challenges, demanding sophisticated reasoning across multiple modalities and deep comprehension of multimodal contexts. In this paper, we present MM-InstructEval, a comprehensive evaluation framework that incorporates diverse metrics to assess model performance across various multimodal reasoning tasks with vision-text contexts. We conduct extensive zero-shot evaluations on 45 models (including 36 MLLMs) across 16 multimodal datasets, encompassing 6 distinct tasks using 10 different instructions. Our framework introduces multiple innovative metrics, including the 'Best Performance' metric to benchmark peak model capabilities, the 'Mean Relative Gain' metric to assess overall efficacy across models and instructions, the 'Stability' metric to measure robustness, and the 'Adaptability' metric to quantify the compatibility between models and instructions. Through comprehensive evaluation and analysis, we uncover several significant insights about model architectures, instruction formats, and their interactions in multimodal reasoning tasks. Our findings establish new benchmarks for assessing the reasoning capabilities of MLLMs and provide strategic guidance for future developments. To facilitate continued research and evaluation in this field, we release our framework and resources at https://github.com/declare-lab/MM-InstructEval, with an interactive leaderboard available at MM-InstructEval Leaderboard (https://declare-lab.github.io/MM-InstructEval/).

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Z Yang, W Liu, B Gao, T Xie, Y Li, W Ouyang, S Poria, E Cambria, D Zhou

ICLR 2025 83 citations
We investigate whether LLMs can automatically discover novel and valid chemistry research hypotheses given only a research question. Using a benchmark of 51 chemistry papers, we break the task into retrieving inspirations and generating hypotheses. We develop an LLM-based multi-agent framework and show the proposed method can rediscover many hypotheses with high similarity to ground truth.

Multimodal analysis combining monitoring modalities to elicit cognitive states and perform screening for mental disorders

N Howard, S Poria, N Majumder, S Kanareykin, S Rawlley, T Juarez

US Patent App. 18/887,228 2025
Embodiments provide techniques for mental health screening using multimodal analysis (video, audio, speech). Features are extracted per modality and fused (late fusion) to classify fused features using trained models for detection of disorders such as depression, anxiety, and PTSD. The system can generate a disorder-state representation and be integrated into telehealth platforms for screening and monitoring.

Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards

CY Hung, N Majumder, H Deng, L Renhang, Y Ang, A Zadeh, C Li, S Poria

arXiv preprint arXiv:2511.14659 2025
Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.

Nora: A small open-sourced generalist vision language action model for embodied tasks

CY Hung, Q Sun, P Hong, A Zadeh, C Li, U Tan, N Majumder, S Poria

arXiv preprint arXiv:2504.19854 2025 62 citations
Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our \model{} is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.

Not all votes count! Translated program for verification improves self-consistency of language models for math reasoning

V Toh, D Ghosal, S Poria

AI for Math Workshop @ ICML 2025 2025
Large language models (LLMs) have become increasingly capable of solving mathematical reasoning problems. However, many open-source LLMs still encounter issues with calculation errors and semantic misunderstandings during intermediate reasoning steps. In this work, we present Prove, a simple yet effective framework that leverages translated Python programs derived from natural language solutions as a verification mechanism. This verification mechanism helps identify and filter out potentially incorrect paths before final answers are aggregated. Unlike basic majority voting, our approach rejects solutions whose program outputs do not align with the generated solution, only aggregating those that pass the verification step. We conducted extensive experiments with 13 open-source LLMs of various model sizes across eight mathematical benchmarks, demonstrating consistent improvements over baselines.

Pixel-level reasoning segmentation via multi-turn conversations

D Cai, X Yang, Y Liu, D Wang, S Feng, Y Zhang, S Poria

ACL 2025
Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://anonymous.4open.science/r/PixelRS/.

PREMISE: Matching-based Prediction for Accurate Review Recommendation

W Han, H Chen, S Poria

NAACL Findings 2025
We present PREMISE, a new architecture for the matching-based learning in the multimodal fields for the MRHP task. Distinct to previous fusion-based methods which obtains multimodal representations via cross-modal attention for downstream tasks, PREMISE computes the multi-scale and multi-field representations, filters duplicated semantics, and then obtained a set of matching scores as feature vectors for the downstream recommendation task. This new architecture significantly boosts the performance for such multimodal tasks whose context matching content are highly correlated to the targets of that task, compared to the state-of-the-art fusion-based methods. Experimental results on two publicly available datasets show that PREMISE achieves promising performance with less computational cost.

PROEMO: Prompt-driven text-to-speech synthesis based on emotion and intensity control

S Zhang, A Mehrish, Y Li, S Poria

arXiv preprint arXiv:2501.06276 2025
Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.

PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference

W Jin, M Song, TD Pala, YK Chia, A Zadeh, C Li, S Poria

arXiv preprint arXiv:2503.23274 2025
As large language models (LLMs) tackle increasingly complex tasks and longer documents, their computational and memory costs during inference become a major bottleneck. To address this, we propose PromptDistill, a novel, training-free method that improves inference efficiency while preserving generation quality. PromptDistill identifies and retains the most informative tokens by leveraging attention interactions in early layers, preserving their hidden states while reducing the computational burden in later layers. This allows the model to focus on essential contextual information without fully processing all tokens. Unlike previous methods such as H2O and SnapKV, which perform compression only after processing the entire input, or GemFilter, which selects a fixed portion of the initial prompt without considering contextual dependencies, PromptDistill dynamically allocates computational resources to the most relevant tokens while maintaining a global awareness of the input. Experiments using our method and baseline approaches with base models such as LLaMA 3.1 8B Instruct, Phi 3.5 Mini Instruct, and Qwen2 7B Instruct on benchmarks including LongBench, InfBench, and Needle in a Haystack demonstrate that PromptDistill significantly improves efficiency while having minimal impact on output quality compared to the original models. With a single-stage selection strategy, PromptDistill effectively balances performance and efficiency, outperforming prior methods like GemFilter, H2O, and SnapKV due to its superior ability to retain essential information. Specifically, compared to GemFilter, PromptDistill achieves an overall $1\%$ to $5\%$ performance improvement while also offering better time efficiency. Additionally, we explore multi-stage selection, which further improves efficiency while maintaining strong generation performance.

Reward-Guided Tree Search for Inference Time Alignment of Large Language Models

CY Hung, N Majumder, A Mehrish, S Poria

NAACL 2025
Inference-time computation methods enhance the performance of Large Language Models (LLMs) by leveraging additional computational resources to achieve superior results. Common techniques, such as Best-of-N sampling, Majority Voting, and variants of tree-search algorithm have proven to be effective in boosting the performance of LLMs. These approaches strategically trade increased computational resource for improved model responses. In this work, we proposed DARWIN, an inference-time alignment method that leverage the guidance of a reward model to achieve alignment through reward-guided tree search. Empirical evidences indicates that our method outperform other inference-time alignment methods such as Best-of-N and ARGS on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Furthermore, we show that our inference-time approach achieves performance comparable to preference-tuned models on both benchmarks, highlighting the effectiveness of trading inference-time compute for enhanced performance during inference.

The ACM Multimedia 2025 Grand Challenge of Multimodal Conversational Aspect-based Sentiment Analysis

M Luo, H Fei, B Li, S Wu, Q Liu, S Poria, E Cambria, ML Lee, W Hsu

ACM MM 2025
Understanding fine-grained sentiment dynamics in human conversations is central for next-generation AI, especially when interactions are rich in modalities and context. To advance research, we organized the MCABSA challenge and introduce two subtasks: Panoptic Sentiment Sextuple Extraction and Sentiment Flipping Analysis. To support these tasks, we present the PanoSent dataset, a high-quality benchmark annotated across text, image, audio, and video modalities, and summarize top systems and findings.

The jumping reasoning curve? Tracking the evolution of reasoning performance in GPT-[n] and o-[n] models on multimodal puzzles

VYH Toh, YK Chia, D Ghosal, S Poria

arXiv preprint arXiv:2502.01081 2025
The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models (including o1, o3, and o4-mini) on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA, which demand fine-grained visual perception. Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning. Nonetheless, despite these substantial advancements and the superior capabilities demonstrated by the o-[n] series, our findings highlight that even these leading models face persistent challenges. Difficulties are particularly evident in tasks requiring precise visual perception, robust compositional reasoning across multiple visual attributes, and solving complex algorithmic or highly combinatorial puzzles, indicating critical areas for future AGI development. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available at https://github.com/declare-lab/LLM-PuzzleTest.

Toward robust multimodal sentiment analysis using multimodal foundational models

X Zhao, S Poria, X Li, Y Chen, B Tang

Expert Systems with Applications 2025
This paper investigates robust approaches to multimodal sentiment analysis using multimodal foundational models. We analyze model robustness under modality corruption, distributional shifts, and noisy real-world inputs, proposing training and adaptation strategies that substantially improve performance across benchmarks.

Training vision-language process reward models for test-time scaling in multimodal reasoning: Key insights and lessons learned

B Ong, TD Pala, V Toh, WC Tjhi, S Poria

arXiv preprint arXiv:2509.23250 2025
Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models. While PRMs have been extensively studied in text-based domains, their extension to Vision Language Models (VLMs) remains limited. Existing Vision-Language PRMs (VL-PRMs) rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce noisy supervision signals and limit generalization across tasks. In this work, we aim to elucidate the design space of VL-PRMs by exploring diverse strategies for dataset construction, training, and test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with judgments from a strong VLM, producing more accurate step-level labels. Second, we propose perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments covering five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) reveal several key insights: (i) VL-PRMs when used as Outcome Reward Models (ORMs) during test-time scaling (TTS) can outperform VL-PRM guided process step selection, (ii) smaller VL-PRMs can match or even surpass larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning abilities in stronger VLM backbones, (iv) perception-level supervision leads to significant gains in test-time scaling, and (v) TTS performance of different policies improve on advanced math reasoning datasets despite not training VL-PRMs on such datasets. We hope our work will motivate further research and support the advancement of VLMs.

Why AI is weird and should not be this way: Towards AI for everyone, with everyone, by everyone

R Mihalcea, O Ignat, L Bai, A Borah, L Chiruzzo, Z Jin, C Kwizera, J Nwatu, S Poria

AAAI 2025 56 citations
This paper presents a vision for creating AI systems that are inclusive at every stage of development, from data collection to model design and evaluation. We address key limitations in the current AI pipeline and its WEIRD representation, such as lack of data diversity, biases in model performance, and narrow evaluation metrics. We also focus on the need for diverse representation among the developers of these systems, as well as incentives that are not skewed toward certain groups. We highlight opportunities to develop AI systems that are for everyone (with diverse stakeholders in mind), with everyone (inclusive of diverse data and annotators), and by everyone (designed and developed by a globally diverse workforce).

2024

Can-do! A dataset and neuro-symbolic grounded framework for embodied planning with large multimodal models

YK Chia, Q Sun, L Bing, S Poria

arXiv preprint arXiv:2409.14277 2024
Abstract:Large multimodal models have demonstrated impressive problem-solving abilities in vision and language tasks, and have the potential to encode extensive world knowledge. However, it remains an open challenge for these models to perceive, reason, plan, and act in realistic environments. In this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities through more diverse and complex scenarios than previous datasets. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. The data encompasses diverse aspects of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception, comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans. Experimental results demonstrate the effectiveness of our framework compared to strong baselines. Our code and dataset are available at this https URL.

Chain-of-Knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources

X Li, R Zhao, YK Chia, B Ding, S Joty, S Poria, L Bing

ICLR 2024 292 citations
We present chain-of-knowledge (CoK), a novel framework that augments large language models (LLMs) by dynamically incorporating grounding information from heterogeneous sources. It results in more factual rationales and reduced hallucination in generation. Specifically, CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a knowledge-intensive question, CoK first prepares several preliminary rationales and answers while identifying the relevant knowledge domains.If there is no majority consensus among the answers from samples, CoK corrects the rationales step by step by adapting knowledge from the identified domains.These corrected rationales can plausibly serve as a better foundation for the final answer consolidation.Unlike prior studies that primarily use unstructured data, CoK also leverages structured knowledge sources such as Wikidata and tables that provide more reliable factual information.To access both unstructured and structured knowledge sources in the dynamic knowledge adapting stage, we propose an adaptive query generator that allows the generation of queries for various types of query languages, including SPARQL, SQL, and natural sentences. Moreover, to minimize error propagation between rationales, CoK corrects the rationales progressively using preceding corrected rationales to generate and correct subsequent rationales.Extensive experiments show that CoK consistently improves the performance of LLMs on knowledge-intensive tasks across different domains.

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

X Li, F Bu, A Mehrish, Y Li, J Han, B Cheng, S Poria

NAACL Findings 2024
Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS’s superiority over existing single-step speech synthesis systems, representing a significant advancement in the field.

Consistency Guided Knowledge Retrieval and Denoising in LLMs for Zero-shot Document-level Relation Triplet Extraction

Q Sun, K Huang, X Yang, R Tong, K Zhang, S Poria

WWW 2024 64 citations
Document-level Relation Triplet Extraction (DocRTE) is a fundamental task in information systems that aims to simultaneously extract entities with semantic relations from a document. Existing methods heavily rely on a substantial amount of fully labeled data. However, collecting and annotating data for newly emerging relations is time-consuming and labor-intensive. Recent advanced Large Language Models (LLMs), such as ChatGPT and LLaMA, exhibit impressive long-text generation capabilities, inspiring us to explore an alternative approach for obtaining auto-labeled documents with new relations. In this paper, we propose a Zero-shot Document-level Relation Triplet Extraction (ZeroDocRTE) framework, which Generates labeled data by Retrieval and Denoising Knowledge from LLMs, called GenRDK. Specifically, we propose a chain-of-retrieval prompt to guide ChatGPT to generate labeled long-text data step by step. To improve the quality of synthetic data, we propose a denoising strategy based on the consistency of cross-document knowledge. Leveraging our denoised synthetic data, we proceed to fine-tune the LLaMA2-13B-Chat for extracting document-level relation triplets. We perform experiments for both zero-shot document-level relation and triplet extraction on two public datasets. The experimental results illustrate that our GenRDK framework outperforms strong baselines.

Content extraction based on hop distance within a graph model

Y Dong, Y Deng, J Zhang, F Gelli, T Lin, Y Zhuo, H Wang, S Poria

US Patent App. 17/983,908 2024
A method of categorizing text entries on a document can include determining, for each of a plurality of text bounding boxes in the document, respective text, respective coordinates, and respective input embeddings. The method may further include defining a graph of the plurality of bounding boxes, the graph comprising a plurality of connections among the plurality of bounding boxes, each connection comprising a first and second bounding box and zero or more respective intermediate bounding boxes. The method may further include determining a respective attention value for each connection according to a quantity of intermediate bounding boxes in the connection and, based on the respective attention values and a transformer-based machine learning model applied to the respective input embeddings and respective coordinates, determining output embeddings for each bounding box and, based on the respective output embeddings, generating a bounding box label for each bounding box.

DELLA-Merging: Reducing interference in model merging through magnitude-based sampling

PT Deep, R Bhardwaj, S Poria

arXiv preprint arXiv:2406.11617 2024 67 citations
Abstract:With the proliferation of domain-specific models, model merging has emerged as a set of techniques that combine the capabilities of multiple models into one that can multitask without the cost of additional training. In this paper, we propose a new model merging technique, Drop and rEscaLe via sampLing with mAgnitude (DELLA-Merging), that employs a novel pruning technique, MAGPRUNE, which shows significant advantages over DARE and TIES. MAGPRUNE first ranks the parameters in order of their magnitude and assigns higher dropout probabilities (p) to parameters with lower ranks corresponding to lower magnitudes. To approximate the original embeddings, MAGPRUNE employs a rescaling operation on the parameters that survive the random dropping by 1/(1 - p). On three different expert models considered for merging (LM, Math, Code) and corresponding benchmark datasets (AlpacaEval, GSM8K, MBPP), DELLA shows an average improvement of 2.4 points over baseline methods employing delta parameter pruning (an improvement of 3.6 points over TIES, 1.2 points over DARE), and 11.1 points over the no-pruning baseline (TA). We release the source code at: this https URL.

Domain-expanded ASTE: Rethinking generalization in aspect sentiment triplet extraction

YK Chia, H Chen, G Chen, W Han, SM Aljunied, S Poria, L Bing

Workshop on Social Influence in Conversations 2024
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments. However, existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains, raising concerns about the generalization of proposed methods. Furthermore, it remains unclear if large language models (LLMs) can effectively handle complex sentiment tasks like ASTE. In this work, we address the issue of generalization in ASTE from both a benchmarking and modeling perspective. We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings. Additionally, we propose CASE, a simple and effective decoding strategy that enhances trustworthiness and performance of LLMs in ASTE. Through comprehensive experiments involving multiple tasks, settings, and models, we demonstrate that CASE can serve as a general decoding strategy for complex sentiment tasks. By expanding the scope of evaluation and providing a more reliable decoding strategy, we aim to inspire the research community to reevaluate the generalizability of benchmarks and models for ASTE. Our code, data, and models are available at https://github.com/DAMO-NLP-SG/domain-expanded-aste.

Hate speech detection: A comprehensive review of recent works

A Gandhi, P Ahir, K Adhvaryu, P Shah, R Lohiya, E Cambria, S Poria, A Hussain

Expert Systems 2024 88 citations
This review surveys recent advances in hate speech detection, covering datasets, annotation practices, model architectures, and evaluation metrics. We discuss challenges including implicit bias, multilinguality, and context-dependence, and outline directions for robust, fair, and explainable hate speech detection systems.

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Y Li, R Bhardwaj, A Mehrish, B Cheng, S Poria

COLING 2024
Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that

Improving text-to-audio models with synthetic captions

Z Kong, S Lee, D Ghosal, N Majumder, A Mehrish, R Valle, S Poria, B Catanzaro

arXiv preprint arXiv:2406.15487 2024
Abstract:It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.

INSTRAUG: Automatic Instruction Augmentation for Multimodal Instruction Fine-tuning

W Han, H Chen, S Poria

arXiv 2024
Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

InstructEval: Towards holistic evaluation of instruction-tuned large language models

YK Chia, P Hong, L Bing, S Poria

Workshop on the Scaling Behavior of Large Language Models 2024 160 citations
Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. However, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and lack of holistic evaluation. To address these challenges, we present InstructEval, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is a crucial factor in scaling model performance. While open-source models demonstrate impressive writing abilities, there is substantial room for improvement in problem-solving and alignment.

Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

R Bhardwaj, DD Anh, S Poria

ACL 2024 107 citations
We propose RESTA to perform LLM realignment towards safety, which gets compromised due to downstream task fine-tuning. RESTA stands for REstoring Safety through Task Arithmetic. At its core, it involves a simple arithmetic addition of a safety vector to the weights of the compromised model. We demonstrate the effectiveness of RESTA in both parameter-efficient and full fine-tuning, covering a wide range of downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math. We also showcase the generalizability of RESTA on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5 sub-categories of harm. Overall, RESTA decreases the harmfulness of the compromised model from 18.6% to 5.1% and from 9.2% to 1.5% in parameter-efficient and full fine-tuning, respectively, while maintaining most of the model’s performance on the task. We release the source codes at: https://github.com/declare-lab/resta.

Large language models for automated open-domain scientific hypotheses discovery

Z Yang, X Du, J Li, J Zheng, S Poria, E Cambria

ACL 2024 132 citations
Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction is under a constrained setting: (1) the observation annotations in the dataset are carefully manually handpicked sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses are mostly commonsense knowledge, making the task less challenging. In this work, we tackle these problems by proposing the first dataset for social science academic hypotheses discovery, with the final goal to create systems that automatically generate valid, novel, and helpful scientific hypotheses, given only a pile of raw web corpus. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi-module framework is developed for the task, including three different feedback mechanisms to boost performance, which exhibits superior performance in terms of both GPT-4 based and expert-based evaluation.To the best of our knowledge, this is the first work showing that LLMs are able to generate novel (”not existing in literature”) and valid (”reflecting reality”) scientific hypotheses.

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Y Li, A Mehrish, B Chew, B Cheng, S Poria

arXiv preprint arXiv:2406.17257 2024
Abstract:Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to build transformer based model such as SpeechT5 and train it on large multilingual dataset. As the size of these models grow the conventional fine-tuning for adapting these model becomes impractical due to heavy computational cost. In this paper, we proposes to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to achieve comparable or even better performance compared to full fine-tuning with only $\sim$2.5\% tunable this http URL code and samples are available at: this https URL.

Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents

Y Dong, L Deng, J Zhang, X Yu, T Lin, F Gelli, S Poria, WS Lee

EACL 2024
Documents that consist of diverse templates and exhibit complex spatial structures pose a challenge for document entity classification. We propose KNN-Former, which incorporates a new kind of spatial bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities. We limit entities’ attention only to their local radius defined by the KNN graph. We also use combinatorial matching to address the one-to-one mapping property that exists in many documents, where one field has only one corresponding entity. Moreover, our method is highly parameter-efficient compared to existing approaches in terms of the number of trainable parameters. Despite this, experiments across various datasets show our method outperforms baselines in most entity types. Many real-world documents exhibit combinatorial properties which can be leveraged as inductive biases to improve extraction accuracy, but existing datasets do not cover these documents. To facilitate future research into these types of documents, we release a new ID document dataset that covers diverse templates and languages. We also release enhanced annotations for an existing dataset.

PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis

M Luo, H Fei, B Li, S Wu, Q Liu, S Poria, E Cambria, ML Lee, W Hsu

ACM MM 2024
While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit&explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community, and thus all our codes and data are open at https://PanoSent.github.io/.

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

YK Chia, VTY Han, D Ghosal, L Bing, S Poria

ACL 2024 67 citations
Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of 2000 puzzle instances based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, GPT-4V achieves a score of 46.4% on single-concept puzzles, which shows that state-of-the-art models struggle on our dataset. To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities. Through this work, we hope to shed light on the limitations of large multimodal models and how they can better emulate human cognitive processes in the future.

Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths

YK Chia, G Chen, W Xu, LA Tuan, S Poria, L Bing

EMNLP 2024
Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through step-by-step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model’s overall problem-solving performance. Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at https://reasoning-paths.github.io.

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

CY Hung, N Majumder, A Mehrish, S Poria

arXiv 2024
Abstract:Inference-time computation methods enhance the performance of Large Language Models (LLMs) by leveraging additional computational resources to achieve superior results. Common techniques, such as Best-of-N sampling, Majority Voting, and variants of tree-search algorithms have proven to be effective in boosting the performance of LLMs. These approaches strategically trade increased computational resources for improved model responses. In this work, we proposed DARWIN, an inference-time alignment method that leverages the guidance of a reward model to achieve alignment through a reward-guided tree search. Empirical evidences indicates that our method outperforms other inference-time alignment methods such as Best-of-N and ARGS on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Furthermore, we show that our inference-time approach achieves performance comparable to preference-tuned models on both benchmarks, highlighting the effectiveness of trading inference-time compute for enhanced performance during inference. We have released our codes at this https URL.

Ruby teaming: Improving quality diversity search with memory for automated red teaming

VTY Han, R Bhardwaj, S Poria

arXiv preprint arXiv:2406.11654 2024
Abstract:We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory cache as its third dimension. The memory dimension provides cues to the mutator to yield better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms of quality diversity, Ruby Teaming outperforms Rainbow Teaming by 6% and 3% on Shannon's Evenness Index (SEI) and Simpson's Diversity Index (SDI), respectively.

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

R Hazra, S Layek, S Banerjee, S Poria

EMNLP 2024
Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.

Self-adaptive sampling for accurate video question answering on image text models

W Han, H Chen, MY Kan, S Poria

NAACL Findings 2024
Image–text models (ITMs) is the prevalent architecture to solve video question–answering tasks, which requires only a few input frames to save huge computational cost compared to video–language models.However, we find existent ITM video question–answering solutions either 1) adopt simplistic and unintentional sampling strategies, which may miss key frames to offer the answer clues; or 2) sample a large number of frames into divided groups, which the computational sources can not accommodate. In this work, we aim at an efficient sampling method towards the few-frame situations.We first summarize a family of prior sampling methods based on question–frame correlation into a unified one, dubbed *Most Implied Frames* (MIF). Through some primary results and analysis, Through analysis, we form a hypothesis that question-aware sampling is not necessary, from which we further propose the other method *Most Dominant Frames* (MDF).Experimental results on four public datasets and three advanced ITMs demonstrate that our proposed strategies can boost the performance for image–text pretrained models, and have a wide application scenario in terms of model architectures and dataset types. Our code is available at https://github.com/declare-lab/Sealing https://github.com/declare-lab/Sealing .

Sowing the wind, reaping the whirlwind: The impact of editing language models

R Hazra, S Layek, S Banerjee, S Poria

ACL 2024
In the rapidly advancing field of artificial intelligence, the concept of ‘Red-Teaming’ or ‘Jailbreaking’ large language models (LLMs) has emerged as a crucial area of study. This approach is especially significant in terms of assessing and enhancing the safety and robustness of these models. This paper investigates the intricate consequences of such modifications through model editing, uncovering a complex relationship between enhancing model accuracy and preserving its ethical integrity. Our in-depth analysis reveals a striking paradox: while injecting accurate information is crucial for model reliability, it can paradoxically destabilize the model’s foundational framework, resulting in unpredictable and potentially unsafe behaviors. Additionally, we propose a benchmark dataset NicheHazardQA to investigate this unsafe behavior both within the same and cross topical domain. This aspect of our research sheds light on how the edits, impact the model’s safety metrics and guardrails. Our findings show that model editing serves as a cost-effective tool for topical red-teaming by methodically applying targeted edits and evaluating the resultant model behavior.

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

N Majumder, CY Hung, D Ghosal, WN Hsu, R Mihalcea, S Poria

ACM MM 2024 168 citations
Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.

Towards robust instruction tuning on multimodal large language models

W Han, H Chen, S Poria

arXiv preprint arXiv:2402.14492 2024
Abstract:Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

Two are better than one: Context window extension with multi-grained self-injection

W Han, P Zhou, S Poria, S Yan

arXiv preprint arXiv:2410.19318 2024
The limited context window of contemporary large language models (LLMs) remains a huge barrier to their broader application. We propose SharedLLM, a novel approach based on multi-grained context compression and query-aware information retrieval. SharedLLM composes two stacked short-context LLMs: a lower model acting as a compressor and an upper model as a decoder. The lower model compresses long inputs into compact, multi-grained representations forwarded to the upper model, enabling efficient context-aware processing. A specialized tree-style data structure and retrieval algorithm enable rapid encoding and lookup of multi-grained contextual information. This approach reduces memory footprint and yields inference speedups while generalizing to very long inputs.

Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense

S Shen, L Logeswaran, M Lee, H Lee, S Poria, R Mihalcea

NAACL (Social Impact Award) 2024 87 citations
Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs’ general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks.Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally-aware language models.

Understanding, Leveraging, and Improving Large Language Models

S Poria

Companion Proceedings of the ACM Web Conference 2024
The emergence of Large Language Models (LLMs) has marked a substantial advancement in Natural Language Processing (NLP), contributing significantly to enhanced task performance both within and outside specific domains. However, amidst these achievements, three key questions remain unanswered: 1) The mechanism through which LLMs accomplish their tasks and their limitations, 2) Effectively harnessing the power of LLMs across diverse domains, and 3) Strategies for enhancing the performance of LLMs. This talk aims to delve into our research group's endeavors to address these pivotal questions. Firstly, I will outline our approach, which involves utilizing ontology-guided prompt perturbations to unravel the primary limitations of LLMs in solving mathematical problems. Moving on to the second question, we will explore the utilization of synthetic data generated by LLMs to bolster challenging downstream tasks, particularly focusing on structured prediction where LLMs face persistent challenges. I will elaborate on our initiatives aimed at improving LLMs by incorporating highly effective retrieval strategies, specifically addressing the prevalent challenge of hallucinations that often plagues contemporary LLMs. Finally, I will present a technique on LLM realignment to restore safety lost during fine-tuning.

Video2music: Suitable music generation from videos using an affective multimodal transformer model

J Kang, S Poria, D Herremans

Expert Systems with Applications 2024 101 citations
We develop Video2Music, a generative music framework that matches provided videos by extracting semantic, scene, motion, and emotion features from curated music videos. Audio is transcribed into MIDI/chords and features like note density and loudness are extracted; an Affective Multimodal Transformer (AMT) is trained on the MuVi-Sync dataset to generate music conditioned on video features with an affective-similarity mechanism. A post-processing biGRU-based regressor estimates note density and loudness to render dynamic output. User studies confirm that generated music matches video emotion and quality.

Walledeval: A comprehensive safety evaluation toolkit for large language models

P Gupta, LQ Yau, HH Low, IS Lee, HM Lim, YX Teoh, KJ Hng, DW Liew, R Bhardwaj, S Poria

EMNLP 2024
WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking, and incorporates custom mutators to test safety against various text-style mutations such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural contexts. We make WalledEval publicly available at https://github.com/walledai/walledeval with a demonstration video at https://youtu.be/50Zy97kj1MA.

2023

A Comprehensive Survey of Sentence Representations: From the BERT Epoch to the ChatGPT Era and Beyond

AR Kashyap, TT Nguyen, V Schlegel, S Winkler, SK Ng, S Poria

EACL 2023
Sentence representations are a critical component in NLP applications such as retrieval, question answering, and text classification. They capture the meaning of a sentence, enabling machines to understand and reason over human language. In recent years, significant progress has been made in developing methods for learning sentence representations, including unsupervised, supervised, and transfer learning approaches. However there is no literature review on sentence representations till now. In this paper, we provide an overview of the different methods for sentence representation learning, focusing mostly on deep learning models. We provide a systematic organization of the literature, highlighting the key contributions and challenges in this area. Overall, our review highlights the importance of this area in natural language processing, the progress made in sentence representation learning, and the challenges that remain. We conclude with directions for future research, suggesting potential avenues for improving the quality and efficiency of sentence representations.

A Review of Deep Learning Techniques for Speech Processing

A Mehrish, N Majumder, R Bhardwaj, R Mihalcea, S Poria

Information Fusion 2023 514 citations
The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field.

A Robust Information-Masking Approach for Domain Counterfactual Generation

P Hong, R Bhardwaj, N Majumder, S Aditya, S Poria

ACL Findings 2023
Domain shift is a big challenge in NLP. Many approaches, thus, resort to learning domain-invariant features to mitigate the hurdles of domain shift during inference. Such methods, however, inexorably fail to leverage the domain-specific nuances relevant to the task at hand. To avoid such drawbacks, domain counterfactual generation has recently been proposed that aims to transform a text from the source domain to a given target domain. To achieve this, the existing method uses a frequency-based approach to identify and mask the source-domain-specific tokens in a text. A pretrained LM is then prompted to fill the masks with target-domain-specific tokens. We, however, have observed that, due to limitations of the available data, such a frequency-based method may either miss some domain-token associations or lead to some spurious domain-token associations. To this end, we additionally employ attention norm-based scores to identify additional token-domain associations from a domain classifier. To minimize spurious associations, we also devise an iterative unmasking heuristic that unmasks the masked tokens to minimize the confidence of a domain classifier in the source domain. Our experiments empirically show that the counterfactual samples sourced from our masked text lead to improved domain transfer across various classification tasks. The proposed approach outperforms the baselines on 10 out of 12 domain-counterfactual classification settings with an average of 1.7% improvement in accuracy metric.

Adapter Pruning using Tropical Characterization

R Bhardwaj, T Vaidya, S Poria

EMNLP Findings 2023
Adapters are widely popular parameter-efficient transfer learning approaches in natural language processing that insert trainable modules in between layers of a pre-trained language model. Apart from several heuristics, however, there has been a lack of studies analyzing the optimal number of adapter parameters needed for downstream applications. Thus, we propose an adapter pruning approach by studying the tropical characteristics of trainable modules. We cast it as an optimization problem that aims to prune parameters from the adapter layers without changing the orientation of underlying tropical hypersurfaces. Our experiments on five NLP datasets show that tropical geometry tends to identify more relevant parameters to prune when compared with the magnitude-based baseline, while a combined approach works best across the tasks.

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

A Mehrish, AR Kashyap, L Yingting, N Majumder, S Poria

Interspeech 2023
There are significant challenges for speaker adaptation in text-to-speech for languages that are not widely spoken or for speakers with accents or dialects that are not well-represented in the training data. To address this issue, we propose the use of the"mixture of adapters"method. This approach involves adding multiple adapters within a backbone-model layer to learn the unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests when using only one minute of data for each new speaker. Moreover, following the adapter paradigm, we fine-tune only the adapter parameters (11% of the total model parameters). This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind. Overall, our proposed approach offers a promising solution to the speech synthesis techniques, particularly for adapting to speakers from diverse backgrounds.

Contrastive chain-of-thought prompting

YK Chia, G Chen, LA Tuan, S Poria, L Bing

arXiv preprint arXiv:2311.09277 2023 106 citations
Abstract:Despite the success of chain of thought in enhancing language model reasoning, the underlying process remains less well understood. Although logically sound reasoning appears inherently crucial for chain of thought, prior studies surprisingly reveal minimal impact when using invalid demonstrations instead. Furthermore, the conventional chain of thought does not inform language models on what mistakes to avoid, which potentially leads to more errors. Hence, inspired by how humans can learn from both positive and negative examples, we propose contrastive chain of thought to enhance language model reasoning. Compared to the conventional chain of thought, our approach provides both valid and invalid reasoning demonstrations, to guide the model to reason step-by-step while reducing reasoning mistakes. To improve generalization, we introduce an automatic method to construct contrastive demonstrations. Our experiments on reasoning benchmarks demonstrate that contrastive chain of thought can serve as a general enhancement of chain-of-thought prompting.

Dialogue relation extraction with document-level heterogeneous graph attention networks

H Chen, P Hong, W Han, N Majumder, S Poria

Cognitive Computation 2023 72 citations
We propose a heterogeneous graph attention network to address the problem of dialogue relation extraction. Compared with several popular sequence-based and graph-based models, our method shows superior performance on the benchmark dataset DialogRE. The implementation of this work can be found at https://github.com/declare-lab/dialog-HGAT Dialogue relation extraction aims to detect the relation between pairs of entities mentioned in a multi-party dialogue. It plays an essential role in understanding the deep logic of dialogues and facilitating the development of intelligent dialogue systems. We introduce a heterogeneous graph attention network to model the cross-sentence relations in a conversation. This heterogeneous graph attention network has modeled multi-type features of the conversation, such as utterance, word, speaker, argument, and entity type information. We compare our method with several popular baselines such as convolutional neural networks and long short-term memory, experimental results show our model outperforms the state-of-the-art method by 9.4%/7.8% F 1 scores, and 6.6%/3.9% $$F1_c$$ F 1 c scores in both validation and test sets with only 4.0M parameters. In this work, we present an attention-based heterogeneous graph network to deal with the dialogue relation extraction task in an inductive manner. Experimental results on the dataset DialogRE confirm the effectiveness of our method.

Evaluating parameter-efficient transfer learning approaches on SURE benchmark for speech understanding

Y Li, A Mehrish, R Bhardwaj, N Majumder, B Cheng, S Zhao, A Zadeh, R Mihalcea, S Poria

ICASSP 2023
Fine-tuning is widely used as the default algorithm for transfer learning from pre-trained models. Parameter inefficiency can however arise when, during transfer learning, all the parameters of a large pre-trained model need to be updated for individual downstream tasks. As the number of parameters grows, fine-tuning is prone to overfitting and catastrophic forgetting. In addition, full fine-tuning can become prohibitively expensive when the model is used for many tasks. To mitigate this issue, parameter-efficient transfer learning algorithms, such as adapters and prefix tuning, have been proposed as a way to introduce a few trainable parameters that can be plugged into large pre-trained language models such as BERT, HuBERT. In this paper, we introduce the Speech UndeRstanding Evaluation (SURE) benchmark for parameter-efficient learning for various speech processing tasks. Additionally, we introduce a new adapter, ConvAdapter, based on 1D convolution. We show that ConvAdapter outperforms the standard adapters while showing comparable performance against prefix tuning and Low-Rank Adaptation with only 0.94% of trainable parameters.

Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt

X Yang, S Feng, D Wang, Q Sun, W Wu, Y Zhang, P Hong, S Poria

ACL Findings 2023
We have witnessed the rapid proliferation of multimodal data on numerous social media platforms. Conventional studies typically require massive labeled data to train models for Multimodal Aspect-Based Sentiment Analysis (MABSA). However, collecting and annotating fine-grained multimodal data for MABSA is tough. To alleviate the above issue, we perform three MABSA-related tasks with quite a small number of labeled multimodal samples. We first build diverse and comprehensive multimodal few-shot datasets according to the data distribution. To capture the specific prompt for each aspect term in a few-shot scenario, we propose a novel Generative Multimodal Prompt (GMP) model for MABSA, which includes the Multimodal Encoder module and the N-Stream Decoders module. We further introduce a subtask to predict the number of aspect terms in each instance to construct the multimodal prompt. Extensive experiments on two datasets demonstrate that our approach outperforms strong baselines on two MABSA-related tasks in the few-shot setting.

Few-shot multimodal sentiment analysis based on multimodal probabilistic fusion prompts

X Yang, S Feng, D Wang, Y Zhang, S Poria

ACM MM 2023 55 citations
Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media. However, existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive to collect. Thus, there is a need to address the challenge of few-shot multimodal sentiment analysis. To tackle this problem, we propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario. Specifically, we start by introducing a Consistently Distributed Sampling approach called CDS, which ensures that the few-shot dataset has the same category distribution as the full dataset. Unlike previous approaches primarily using prompts based on the text modality, we design unified multimodal prompts to reduce discrepancies between different modalities and dynamically incorporate multimodal demonstrations into the context of each multimodal instance. To enhance the model's robustness, we introduce a probabilistic fusion method to fuse output predictions from multiple diverse prompts for each input. Our extensive experiments on six datasets demonstrate the effectiveness of our approach. First, our method outperforms strong baselines in the multimodal few-shot setting. Furthermore, under the same amount of data (1% of the full dataset), our CDS-based experimental results significantly outperform those based on previously sampled datasets constructed from the same number of instances of each class.

Flacuna: Unleashing the problem solving power of vicuna using flan fine-tuning

D Ghosal, YK Chia, N Majumder, S Poria

arXiv preprint arXiv:2307.02053 2023
Abstract:Recently, the release of INSTRUCTEVAL has provided valuable insights into the performance of large language models (LLMs) that utilize encoder-decoder or decoder-only architecture. Interestingly, despite being introduced four years ago, T5-based LLMs, such as FLAN-T5, continue to outperform the latest decoder-based LLMs, such as LLAMA and VICUNA, on tasks that require general problem-solving skills. This performance discrepancy can be attributed to three key factors: (1) Pre-training data, (2) Backbone architecture, and (3) Instruction dataset. In this technical report, our main focus is on investigating the impact of the third factor by leveraging VICUNA, a large language model based on LLAMA, which has undergone fine-tuning on ChatGPT conversations. To achieve this objective, we fine-tuned VICUNA using a customized instruction dataset collection called FLANMINI. This collection includes a subset of the large-scale instruction dataset known as FLAN, as well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4. This dataset comprises a large number of tasks that demand problem-solving skills. Our experimental findings strongly indicate that the enhanced problem-solving abilities of our model, FLACUNA, are obtained through fine-tuning VICUNA on the FLAN dataset, leading to significant improvements across numerous benchmark datasets in INSTRUCTEVAL. FLACUNA is publicly available at this https URL.

kNN-CM: A Non-parametric Inference-Phase Adaptation of Parametric Text Classifiers

R Bhardwaj, Y Li, N Majumder, B Cheng, S Poria

EMNLP Findings 2023
Semi-parametric models exhibit the properties of both parametric and non-parametric modeling and have been shown to be effective in the next-word prediction language modeling task. However, there is a lack of studies on the text-discriminating properties of such models. We propose an inference-phase approach— k -Nearest Neighbor Classification Model ( k NN-CM)—that enhances the capacity of a pre-trained parametric text classifier by incorporating a simple neighborhood search through the representation space of (memorized) training samples. The final class prediction of k NN-CM is based on the convex combination of probabilities obtained from k NN search and prediction of the classifier. Our experiments show consistent performance improvements on eight SuperGLUE tasks, three adversarial natural language inference (ANLI) datasets, 11 question-answering (QA) datasets, and two sentiment classification datasets.

Language guided visual question answering: Elevate your multimodal language model using knowledge-enriched prompts

D Ghosal, N Majumder, RKW Lee, R Mihalcea, S Poria

EMNLP 2023
Visual question answering (VQA) is the task of answering questions about an image. The task assumes an understanding of both the image and the question to provide a natural language answer. VQA has gained popularity in recent years due to its potential applications in a wide range of fields, including robotics, education, and healthcare. In this paper, we focus on knowledge-augmented VQA, where answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image. We propose a multimodal framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately. We benchmark our method on the multi-choice question-answering task of the A-OKVQA, Science-QA, VSR, and IconQA datasets using CLIP and BLIP models. We show that the use of language guidance is a simple but powerful and effective strategy for visual question answering. Our language guidance improves the performance of CLIP by 7.6% and BLIP-2 by 4.8% in the challenging A-OKVQA dataset. We also observe consistent improvement in performance on the Science-QA, VSR, and IconQA datasets when using the proposed language guidances. The implementation of LG-VQA is publicly available at https://github.com/declare-lab/LG-VQA.

Language model unalignment: Parametric red-teaming to expose hidden harms and biases

R Bhardwaj, S Poria

arXiv preprint arXiv:2310.14303 2023
Abstract:Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior. Unalignment using as few as 100 examples can significantly bypass commonly referred to as CHATGPT, to the point where it responds with an 88% success rate to harmful queries on two safety benchmark datasets. On open-source models such as VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more than 91%. On bias evaluations, Unalignment exposes inherent biases in safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's responses are strongly biased and opinionated 64% of the time.

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

Z Hu, Y Lan, L Wang, W Xu, EP Lim, RKW Lee, L Bing, S Poria

EMNLP 2023 425 citations
The success of large language models (LLMs), like GPT-4 and ChatGPT, has led to the development of numerous cost-effective and accessible alternatives that are created by finetuning open-access LLMs with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the most attractive topics, as it only requires fine-tuning a few external parameters instead of the entire LLMs while achieving comparable or even better performance. To enable further research on PEFT methods of LLMs, this paper presents LLM-Adapters, an easy-to-use framework that integrates various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different tasks. The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, and GPT-J, as well as widely used adapters such as Series adapters, Parallel adapter, Prompt-based learning and Reparametrization-based methods. Moreover, we conduct extensive empirical studies on the impact of adapter types, placement locations, and hyper-parameters to the best design for each adapter-based methods. We evaluate the effectiveness of the adapters on fourteen datasets from two different reasoning tasks, Arithmetic Reasoning and Commonsense Reasoning. The results demonstrate that using adapter-based PEFT in smaller-scale LLMs (7B) with few extra trainable parameters yields comparable, and in some cases superior, performance to powerful LLMs (175B) in zero-shot inference on simple math reasoning datasets.

MM-BigBench: Evaluating multimodal models on multimodal content comprehension tasks

X Yang, W Wu, S Feng, M Wang, D Wang, Y Li, Q Sun, Y Zhang, X Fu, S Poria

arXiv preprint arXiv:2310.09036 2023
Abstract:The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at this https URL.

Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions

A Gandhi, K Adhvaryu, S Poria, E Cambria, A Hussain

Information Fusion 2023 920 citations
Sentiment analysis (SA) has gained much traction In the field of artificial intelligence (AI) and natural language processing (NLP). There is growing demand to automate analysis of user sentiment towards products or services. Opinions are increasingly being shared online in the form of videos rather than text alone. This has led to SA using multiple modalities, termed Multimodal Sentiment Analysis (MSA), becoming an important research area. MSA utilises latest advancements in machine learning and deep learning at various stages including for multi-modal feature extraction and fusion and sentiment polarity detection, with aims to minimize error rate and improve performance. This survey paper examines primary taxonomy and newly released multimodal fusion architectures. Recent developments in MSA architectures are divided into ten categories, namely early fusion, late fusion, hybrid fusion, model-level fusion, tensor fusion, hierarchical fusion, bi-modal fusion, attention-based fusion, quantum-based fusion and word-level fusion. A comparison of several architectural evolutions in terms of MSA fusion categories and their relative strengths and limitations are presented. Finally, a number of interdisciplinary applications and future research directions are proposed.

Mustango: Toward controllable text-to-music generation

J Melechovsky, Z Guo, D Ghosal, N Majumder, D Herremans, S Poria

NAACL 2023 157 citations
The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models such as MusicGen and AudioLDM2.

Red-teaming large language models using chain of utterances for safety-alignment

R Bhardwaj, S Poria

arXiv preprint arXiv:2308.09662 2023 250 citations
Abstract:Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabilities simply by optimizing over a next-word prediction objective. With the emergence of their properties and encoded knowledge, the risk of LLMs producing harmful outputs increases, making them unfit for scalable deployment for the public. In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that even widely deployed models are susceptible to the Chain of Utterances-based (CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts. Next, we propose RED-INSTRUCT--An approach for the safety alignment of LLMs. It constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting, we collect a dataset that consists of 1.9K harmful questions covering a wide range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2) SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the safety alignment of LLMs by minimizing the negative log-likelihood over helpful responses and penalizing over harmful responses by gradient accent over sample loss. Our model STARLING, a fine-tuned Vicuna-7B, is observed to be more safely aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the utility of the baseline models (TruthfulQA, MMLU, and BBH).

Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding

YX Tan, N Majumder, S Poria

Interspeech 2023
We introduce SEGUE, a sentence-embedder-guided utterance encoder for spoken language understanding that leverages pretrained sentence embeddings to improve utterance representations. SEGUE adapts sentence-level context into utterance encoding, improving slot filling and intent detection on conversational speech benchmarks while being parameter-efficient.

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

D Ghosal, N Majumder, A Mehrish, S Poria

ACM MM 2023 339 citations
The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

UDApter: Efficient Domain Adaptation Using Adapters

B Malik, AR Kashyap, MY Kan, S Poria

EACL 2023
We propose two methods to make unsupervised domain adaptation (UDA) more parameter efficient using adapters – small bottleneck layers interspersed with every layer of the large-scale pre-trained language model (PLM). The first method deconstructs UDA into a two-step process: first by adding a domain adapter to learn domain-invariant information and then by adding a task adapter that uses domain-invariant information to learn task representations in the source domain. The second method jointly learns a supervised classifier while reducing the divergence measure. Compared to strong baselines, our simple methods perform well in natural language inference (MNLI) and the cross-domain sentiment classification task. We even outperform unsupervised domain adaptation methods such as DANN and DSN in sentiment classification, and we are within 0.85% F1 for natural language inference task, by fine-tuning only a fraction of the full model parameters. We release our code at this URL.

Uncertainty Guided Label Denoising for Document-level Distant Relation Extraction

Q Sun, K Huang, X Yang, P Hong, K Zhang, S Poria

ACL 2023
Document-level relation extraction (DocRE) aims to infer complex semantic relations among entities in a document. Distant supervision (DS) is able to generate massive auto-labeled data, which can improve DocRE performance. Recent works leverage pseudo labels generated by the pre-denoising model to reduce noise in DS data. However, unreliable pseudo labels bring new noise, e.g., adding false pseudo labels and losing correct DS labels. Therefore, how to select effective pseudo labels to denoise DS data is still a challenge in document-level distant relation extraction. To tackle this issue, we introduce uncertainty estimation technology to determine whether pseudo labels can be trusted. In this work, we propose a Document-level distant Relation Extraction framework with Uncertainty Guided label denoising, UGDRE. Specifically, we propose a novel instance-level uncertainty estimation method, which measures the reliability of the pseudo labels with overlapping relations. By further considering the long-tail problem, we design dynamic uncertainty thresholds for different types of relations to filter high-uncertainty pseudo labels. We conduct experiments on two public datasets. Our framework outperforms strong baselines by 1.91 F1 and 2.28 Ign F1 on the RE-DocRED dataset.

WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs

HT Ta, ABS Rahman, N Majumder, A Hussain, L Najjar, N Howard, S Poria, A Gelbukh

Information Fusion 2023
As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We set up a two-phase summarization method - description generation (Phase I) and candidate ranking (Phase II) - as a strong approach that relies on transfer and contrastive learning. For description generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By applying contrastive learning with the diverse input from beam search, the metric fusion-based ranking models outperform the direct description generation models significantly up to 22 ROUGE in topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot effectively capture all sentiment polarities from paragraphs while doing this task better from the gold descriptions. The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes to be a useful dataset for related works in capturing salient information from short paragraphs. The curated dataset is publicly available at: https://github.com/declare-lab/WikiDes.

2022

A dataset for hyper-relational extraction and a cube-filling approach

YK Chia, L Bing, SM Aljunied, L Si, S Poria

EMNLP 2022
Relation extraction has the potential for large-scale knowledge graph construction, but current methods do not consider the qualifier attributes for each relation triplet, such as time, quantity or location. The qualifiers form hyper-relational facts which better capture the rich and complex knowledge graph structure. For example, the relation triplet (Leonard Parker, Educated At, Harvard University) can be factually enriched by including the qualifier (End Time, 1967). Hence, we propose the task of hyper-relational extraction to extract more specific and complete facts from text. To support the task, we construct HyperRED, a large-scale and general-purpose dataset. Existing models cannot perform hyper-relational extraction as it requires a model to consider the interaction between three entities. Hence, we propose CubeRE, a cube-filling model inspired by table-filling approaches and explicitly considers the interaction between relation triplets and qualifiers. To improve model scalability and reduce negative class imbalance, we further propose a cube-pruning method. Our experiments show that CubeRE outperforms strong baselines and reveal possible directions for future research. Our code and data are available at github.com/declare-lab/HyperRED.

Analyzing Modality Robustness in Multimodal Sentiment Analysis

D Hazarika, Y Li, B Cheng, S Zhao, R Zimmermann, S Poria

NAACL 2022 83 citations
Building robust multimodal models are crucial for achieving reliable deployment in the wild. Despite its importance, less attention has been paid to identifying and improving the robustness of Multimodal Sentiment Analysis (MSA) models. In this work, we hope to address that by (i) Proposing simple diagnostic checks for modality robustness in a trained multimodal model. Using these checks, we find MSA models to be highly sensitive to a single modality, which creates issues in their robustness; (ii) We analyze well-known robust training strategies to alleviate the issues. Critically, we observe that robustness can be achieved without compromising on the original performance. We hope our extensive study–performed across five models and two benchmark datasets–and proposed procedures would make robustness an integral component in MSA research. Our diagnostic checks and robust training solutions are simple to implement and available at https://github.com/declare-lab/MSA-Robustness

CICERO: A Dataset for Contextualized Commonsense Inference in Dialogues

D Ghosal, S Shen, N Majumder, R Mihalcea, S Poria

ACL 2022 65 citations
This paper addresses the problem of dialogue reasoning with contextualized commonsense inference. We curate CICERO, a dataset of dyadic conversations with five types of utterance-level reasoning-based inferences: cause, subsequent event, prerequisite, motivation, and emotional reaction. The dataset contains 53,105 of such inferences from 5,672 dialogues. We use this dataset to solve relevant generative and discriminative tasks: generation of cause and subsequent event; generation of prerequisite, motivation, and listener’s emotional reaction; and selection of plausible alternatives. Our results ascertain the value of such dialogue-centric commonsense knowledge datasets. It is our hope that CICERO will open new research avenues into commonsense-based dialogue reasoning.

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

H Chen, W Han, D Yang, S Poria

COLING 2022
This paper proposes a simple yet effective interpolation-based data augmentation approach termed DoubleMix, to improve the robustness of models in text classification. DoubleMix first leverages a couple of simple augmentation operations to generate several perturbed samples for each training data, and then uses the perturbed data and original data to carry out a two-step interpolation in the hidden space of neural models. Concretely, it first mixes up the perturbed data to a synthetic sample and then mixes up the original data and the synthetic perturbed data. DoubleMix enhances models’ robustness by learning the “shifted” features in hidden space. On six text classification benchmark datasets, our approach outperforms several popular text augmentation methods including token-level, sentence-level, and hidden-level data augmentation techniques. Also, experiments in low-resource settings show our approach consistently improves models’ performance when the training data is scarce. Extensive ablation studies and case studies confirm that each component of our approach contributes to the final performance and show that our approach exhibits superior performance on challenging counterexamples. Additionally, visual analysis shows that text features generated by our approach are highly interpretable.

Exemplars-guided empathetic response generation controlled by the elements of human communication

N Majumder, D Ghosal, D Hazarika, A Gelbukh, R Mihalcea, S Poria

IEEE Access 2022
Empathy is fundamental to humans among other animals. It is key to strengthening social cohesion, the cornerstone of health and success of societies. Thus, empathy could be an important component of effective human-computer interactions through conversations. This has motivated a whole sub-field of research focused on empathetic response generation. The majority of existing methods for empathetic response generation rely on the emotion of the context to generate empathetic responses. However, empathy is much more than generating responses with an appropriate emotion. It also often entails subtle expressions of understanding and personal resonance with the situation of the other interlocutor. Unfortunately, such qualities are difficult to quantify, and the datasets lack relevant annotations. To address this issue, in this paper we propose an approach that relies on exemplars to cue the generative model on fine stylistic properties that signal empathy to the interlocutor. To this end, we employ dense passage retrieval to extract relevant exemplary responses from the training set. Three elements of human communication—emotional presence, interpretation, and exploration—and sentiment are additionally introduced using synthetic labels to guide the generation towards empathy. The human evaluation is also extended by these elements of human communication. We empirically show that these approaches yield significant improvements in empathetic response quality in terms of both automated and human-evaluated metrics. The implementation is available at https://github.com/declare-lab/exemplary-empathy.

Improving aspect-level sentiment analysis with aspect extraction

N Majumder, R Bhardwaj, S Poria, A Gelbukh, A Hussain

Neural Computing and Applications 2022 51 citations
Aspect-based sentiment analysis (ABSA), a popular research area in NLP, has two distinct parts—aspect extraction (AE) and labelling the aspects with sentiment polarity (ALSA). Although distinct, these two tasks are highly correlated. The work primarily hypothesizes that transferring knowledge from a pre-trained AE model can benefit the performance of ALSA models. Based on this hypothesis, word embeddings are obtained during AE and, subsequently, feed that to the ALSA model. Empirically, this work shows that the added information significantly improves the performance of three different baseline ALSA models on two distinct domains. This improvement also translates well across domains between AE and ALSA tasks.

Improving zero-shot learning baselines with commonsense knowledge

A Roy, D Ghosal, E Cambria, N Majumder, R Mihalcea, S Poria

Cognitive Computation 2022
Zero-shot learning — the problem of training and testing on a completely disjoint set of classes — relies greatly on its ability to transfer knowledge from train classes to test classes. Traditionally semantic embeddings consisting of human-defined attributes or distributed word embeddings are used to facilitate this transfer by improving the association between visual and semantic embeddings. In this paper, we take advantage of explicit relations between nodes defined in ConceptNet, a commonsense knowledge graph, to generate commonsense embeddings of the class labels by using a graph convolution network-based autoencoder. Our experiments performed on three standard benchmark datasets surpass the strong baselines when we fuse our commonsense embeddings with existing semantic embeddings, i.e., human-defined attributes and distributed word embeddings. This work paves the path to more brain-inspired approaches to zero-short learning.

KNOT: Knowledge Distillation using Optimal Transport for Solving NLP Tasks

R Bhardwaj, T Vaidya, S Poria

COLING 2022
We propose a new approach, Knowledge Distillation using Optimal Transport (KNOT), to distill the natural language semantic knowledge from multiple teacher networks to a student network. KNOT aims to train a (global) student model by learning to minimize the optimal transport cost of its assigned probability distribution over the labels to the weighted sum of probabilities predicted by the (local) teacher models, under the constraints that the student model does not have access to teacher models’ parameters or training data. To evaluate the quality of knowledge transfer, we introduce a new metric, Semantic Distance (SD), that measures semantic closeness between the predicted and ground truth label distributions. The proposed method shows improvements in the global model’s SD performance over the baseline across three NLP tasks while performing on par with Entropy-based distillation on standard accuracy and F1 metrics. The implementation pertaining to this work is publicly available at https://github.com/declare-lab/KNOT .

Knowledge enhanced reflection generation for counseling dialogues

S Shen, V Pérez-Rosas, C Welch, S Poria, R Mihalcea

ACL 2022
In this paper, we study the effect of commonsense and domain knowledge while generating responses in counseling conversations using retrieval and generative methods for knowledge integration. We propose a pipeline that collects domain knowledge through web mining, and show that retrieval from both domain-specific and commonsense knowledge bases improves the quality of generated responses. We also present a model that incorporates knowledge generated by COMET using soft positional encoding and masked self-attention. We show that both retrieved and COMET-generated knowledge improve the system’s performance as measured by automatic metrics and also by human evaluation. Lastly, we present a comparative study on the types of knowledge encoded by our system showing that causal and intentional relationships benefit the generation task more than other types of commonsense relations.

MM-Align: Learning Optimal Transport-based Alignment Dynamics for Fast and Accurate Inference on Missing Modality Sequences

W Han, H Chen, MY Kan, S Poria

EMNLP 2022
Existing multimodal tasks mostly target at the complete input modality setting, i.e., each modality is either complete or completely missing in both training and test sets. However, the randomly missing situations have still been underexplored. In this paper, we present a novel approach named MM-Align to address the missing-modality inference problem. Concretely, we propose 1) an alignment dynamics learning module based on the theory of optimal transport (OT) for missing data imputation; 2) a denoising training algorithm to enhance the quality of imputation as well as the accuracy of model predictions. Compared with previous generative methods which devote to restoring the missing inputs, MM-Align learns to capture and imitate the alignment dynamics between modality sequences. Results of comprehensive experiments on two multimodal tasks empirically demonstrate that our method can perform more accurate and faster inference and alleviate the overfitting issue under different missing conditions.

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends

S Uppal, S Bhagat, D Hazarika, N Majumdar, S Poria, R Zimmermann, A Zadeh

Information Fusion 2022 163 citations
Deep Learning and its applications have catalyzed impactful research across modalities. This survey presents a detailed overview of current trends in vision-and-language research, covering task formulations, solution approaches, evaluation strategies, and emerging challenges. We highlight multi-disciplinary patterns and outline directions toward more modular and transparent multimodal systems.

Multiview contextual commonsense inference: A new dataset and task

S Shen, D Ghosal, N Majumder, H Lim, R Mihalcea, S Poria

arXiv preprint arXiv:2210.02890 2022
Abstract:Contextual commonsense inference is the task of generating various types of explanations around the events in a dyadic dialogue, including cause, motivation, emotional reaction, and others. Producing a coherent and non-trivial explanation requires awareness of the dialogue's structure and of how an event is grounded in the context. In this work, we create CICEROv2, a dataset consisting of 8,351 instances from 2,379 dialogues, containing multiple human-written answers for each contextual commonsense inference question, representing a type of explanation on cause, subsequent event, motivation, and emotional reaction. We show that the inferences in CICEROv2 are more semantically diverse than other contextual commonsense inference datasets. To solve the inference task, we propose a collection of pre-training objectives, including concept denoising and utterance sorting to prepare a pre-trained model for the downstream contextual commonsense inference task. Our results show that the proposed pre-training objectives are effective at adapting the pre-trained T5-Large model for the contextual commonsense inference task.

PIP: Physical Interaction Prediction via mental simulation with span selection

J Duan, S Yu, S Poria, B Wen, C Tan

ECCV 2022
Accurate prediction of physical interaction outcomes is a crucial component of human intelligence and is important for safe and efficient deployments of robots in the real world. While there are existing vision-based intuitive physics models that learn to predict physical interaction outcomes, they mostly focus on generating short sequences of future frames based on physical properties (e.g. mass, friction and velocity) extracted from visual inputs or a latent space. However, there is a lack of intuitive physics models that are tested on long physical interaction sequences with multiple interactions among different objects. We hypothesize that selective temporal attention during approximate mental simulations helps humans in physical interaction outcome prediction. With these motivations, we propose a novel scheme: Physical Interaction Prediction via Mental Simulation with Span Selection (PIP). It utilizes a deep generative model to model approximate mental simulations by generating future frames of physical interactions before employing selective temporal attention in the form of span selection for predicting physical interaction outcomes. To evaluate our model, we further propose the large-scale SPACE+ dataset of synthetic videos with long sequences of three prime physical interactions in a 3D environment. Our experiments show that PIP outperforms human, baseline, and related intuitive physics models that utilize mental simulation. Furthermore, PIP's span selection module effectively identifies the frames indicating key physical interactions among objects, allowing for added interpretability.

Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)

D Ghosh, BB Klebanov, S Muresan, A Feldman, S Poria, T Chakrabarty

Workshop on Figurative Language Processing 2022

RelationPrompt: Leveraging Prompts to Generate Synthetic Data for Zero-Shot Relation Triplet Extraction

YK Chia, L Bing, S Poria, L Si

ACL 2022 175 citations
Despite the importance of relation extraction in building and representing knowledge, less research is focused on generalizing to unseen relations types. We introduce the task setting of Zero-Shot Relation Triplet Extraction (ZeroRTE) to encourage further research in low-resource relation extraction methods. Given an input sentence, each extracted triplet consists of the head entity, relation label, and tail entity where the relation label is not seen at the training stage. To solve ZeroRTE, we propose to synthesize relation examples by prompting language models to generate structured texts. Concretely, we unify language model prompts and structured text approaches to design a structured prompt template for generating synthetic relation samples when conditioning on relation label prompts (RelationPrompt). To overcome the limitation for extracting multiple relation triplets in a sentence, we design a novel Triplet Search Decoding method. Experiments on FewRel and Wiki-ZSL datasets show the efficacy of RelationPrompt for the ZeroRTE task and zero-shot relation classification. Our code and data are available at github.com/declare-lab/RelationPrompt.

SANCL: Multimodal Review Helpfulness Prediction with Selective Attention and Natural Contrastive Learning

W Han, H Chen, Z Hai, S Poria, L Bing

COLING 2022
With the boom of e-commerce, Multimodal Review Helpfulness Prediction (MRHP) that identifies the helpfulness score of multimodal product reviews has become a research hotspot. Previous work on this task focuses on attention-based modality fusion, information integration, and relation modeling, which primarily exposes the following drawbacks: 1) the model may fail to capture the really essential information due to its indiscriminate attention formulation; 2) lack appropriate modeling methods that takes full advantage of correlation among provided data. In this paper, we propose SANCL: Selective Attention and Natural Contrastive Learning for MRHP. SANCL adopts a probe-based strategy to enforce high attention weights on the regions of greater significance. It also constructs a contrastive learning framework based on natural matching properties in the dataset. Experimental results on two benchmark datasets with three categories show that SANCL achieves state-of-the-art baseline performance with lower memory consumption.

SAT: Improving Semi-Supervised Text Classification with Simple Instance-Adaptive Self-Training

H Chen, W Han, S Poria

COLING 2022
Self-training methods have been explored in recent years and have exhibited great performance in improving semi-supervised learning. This work presents a simple instance-adaptive self-training method (SAT) for semi-supervised text classification. SAT first generates two augmented views for each unlabeled data, and then trains a meta learner to automatically identify the relative strength of augmentations based on the similarity between the original view and the augmented views. The weakly-augmented view is fed to the model to produce a pseudo-label and the strongly-augmented view is used to train the model to predict the same pseudo-label. We conducted extensive experiments and analyses on three text classification datasets and found that with varying sizes of labeled training data, SAT consistently shows competitive performance compared to existing semi-supervised learning methods.

So Different Yet So Alike! Constrained Unsupervised Text Style Transfer

AR Kashyap, D Hazarika, MY Kan, R Zimmermann, S Poria

ACL 2022
Automatic transfer of text between domains has become popular in recent times. One of its aims is to preserve the semantic content while adapting to the target domain. However, it does not explicitly maintain other attributes between the source and translated text: e.g., text length and descriptiveness. Maintaining constraints in transfer has several downstream applications, including data augmentation and debiasing. We introduce a method for such constrained unsupervised text style transfer by introducing two complementary losses to the generative adversarial network (GAN) family of models. Unlike the competing losses used in GANs, we introduce cooperative losses where the discriminator and the generator cooperate and reduce the same loss. The first is a contrastive loss and the second is a classification loss — aiming to regularize the latent space further and bring similar sentences closer together. We demonstrate that such training retains lexical, syntactic and domain-specific constraints between domains for multiple benchmark datasets, including ones where more than one attribute change. We show that the complementary cooperative losses improve text quality, according to both automated and human evaluation measures.

Towards solving NLP tasks with optimal transport loss

R Bhardwaj, T Vaidya, S Poria

Journal of King Saud University-Computer and Information Sciences 2022
We explore the use of optimal transport-based loss functions for solving NLP tasks, framing label alignment and distributional matching as transport problems. The proposed loss improves model calibration and helps in tasks where alignment between predicted and target distributions matters, demonstrating gains across several sequence labeling and classification benchmarks.

Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering

D Ghosal, N Majumder, R Mihalcea, S Poria

EMNLP 2022
We propose a simple refactoring of multi-choice question answering (MCQA) tasks as a series of binary classifications. The MCQA task is generally performed by scoring each (question, answer) pair normalized over all the pairs, and then selecting the answer from the pair that yield the highest score. For n answer choices, this is equivalent to an n-class classification setup where only one class (true answer) is correct. We instead show that classifying (question, true answer) as positive instances and (question, false answer) as negative instances is significantly more effective across various models and datasets. We show the efficacy of our proposed approach in different tasks – abductive reasoning, commonsense question answering, science question answering, and sentence completion. Our DeBERTa binary classification model reaches the top or close to the top performance on public leaderboards for these tasks. The source code of the proposed approach is available at https://github.com/declare-lab/TEAM.

Vector-Quantized Input-Contextualized Soft Prompts for Natural Language Understanding

R Bhardwaj, A Saha, SCH Hoi, S Poria

EMNLP 2022
Prompt Tuning has been largely successful as a parameter-efficient method of conditioning large-scale pre-trained language models to perform downstream tasks. Thus far, soft prompt tuning learns a fixed set of task-specific continuous vectors, i.e., soft tokens that remain static across the task samples. A fixed prompt, however, may not generalize well to the diverse kinds of inputs the task comprises. In order to address this, we propose Vector-quantized Input-contextualized Prompts (VIP) as an extension to the soft prompt tuning framework. VIP particularly focuses on two aspects—contextual prompts that learns input-specific contextualization of the soft prompt tokens through a small-scale sentence encoder and quantized prompts that maps the contextualized prompts to a set of learnable codebook vectors through a Vector quantization network. On various language understanding tasks like SuperGLUE, QA, Relation classification, NER and NLI, VIP outperforms the soft prompt tuning (PT) baseline by an average margin of 1.19%. Further, our generalization studies show that VIP learns more robust prompt representations, surpassing PT by a margin of 0.6% - 5.3% on Out-of-domain QA and NLI tasks respectively, and by 0.75% on Multi-Task setup over 4 tasks spanning across 12 domains.

2021

Affect recognition for multimodal natural language processing

S Poria, OY Soon, B Liu, L Bing

Cognitive Computation 2021
Language is inherently multimodal. It has many forms of appearance, like speech, gestures, facial expressions, and head-nods. In an ideal human-machine conversational system, machines should understand this multimodal language. Understanding human language also largely depends on the machines’ ability to interpret emotions. Emotional sensitivity can prevent desultory answers provided by these machines, thus making conversations more natural and engaging. For us humans, emotions aid our learning, communication, and decision-making. Hence, over the past two decades, there has been a significant effort to incorporate cognitive capabilities into machines so that they can interpret, comprehend and express emotions. Computational analysis of human multimodal language is an emerging research area in natural language processing (NLP). It expands the horizons of NLP to study language used in face to face communication and in online multimedia. This form of language contains modalities of language (in terms of spoken text), visual (in terms of gestures and facial expressions) and acoustic (in terms of changes in the voice tone). At its core, this research area is focused on modeling the three modalities and their complex interactions. This special issue on Affect Recognition in Human Multimodal Language aims to facilitate the growth of this new research direction in the community. The challenges of modeling human multimodal language can be split into two major categories: (1) studying each modality individually and modeling each in a manner that can be linked to other modalities (also known as intramodal dynamics, (2) linking the modalities by modeling the interactions between them (also known as intermodal dynamics). Common forms of these interactions include complementary or correlated information across modes. Intrinsic to each modality, modeling human multimodal language is complex due to factors such as idiosyncrasy in communicative styles, non-trivial alignment between modalities and unreliable or contradictory information across modalities. Therefore, computational analysis of multimodal language becomes a challenging research area. This special issue aimed at bringing together contributions from both academics and practitioners in the context of affect recognition in multimodal language with a primary focus on: 1. Multimodal affect recognition in monologues, 2. Effective multimodal fusion. 3. Detecting affect in dyadic and multiparty multimodal conversations. Detecting affect in conversations is more challenging than monologues. This is primarily due to the presence of complex inter-dependency between speaker states in the conversation,

Aspect sentiment triplet extraction using reinforcement learning

S Yu Bai Jian, T Nayak, N Majumder, S Poria

CIKM 2021 53 citations
Aspect Sentiment Triplet Extraction (ASTE) is the task of extracting triplets of aspect terms, their associated sentiments, and the opinion terms that provide evidence for the expressed sentiments. Previous approaches to ASTE usually simultaneously extract all three components or first identify the aspect and opinion terms, then pair them up to predict their sentiment polarities. In this work, we present a novel paradigm, ASTE-RL, by regarding the aspect and opinion terms as arguments of the expressed sentiment in a hierarchical reinforcement learning (RL) framework. We first focus on sentiments expressed in a sentence, then identify the target aspect and opinion terms for that sentiment. This takes into account the mutual interactions among the triplet's components while improving exploration and sample efficiency. Furthermore, this hierarchical RL setup enables us to deal with multiple and overlapping triplets. In our experiments, we evaluate our model on existing datasets from laptop and restaurant domains and show that it achieves state-of-the-art performance. The implementation of this work is publicly available at https://github.com/declare-lab/ASTE-RL.

Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis

W Han, H Chen, A Gelbukh, A Zadeh, L Morency, S Poria

ICMI 2021 332 citations
Multimodal sentiment analysis aims to extract and integrate semantic information collected from multiple modalities to recognize the expressed emotions and sentiment in multimodal data. This research area’s major concern lies in developing an extraordinary fusion scheme that can extract and integrate key information from various modalities. However, previous work is restricted by the lack of leveraging dynamics of independence and correlation between modalities to reach top performance. To mitigate this, we propose the Bi-Bimodal Fusion Network (BBFN), a novel end-to-end network that performs fusion (relevance increment) and separation (difference increment) on pairwise modality representations. The two parts are trained simultaneously such that the combat between them is simulated. The model takes two bimodal pairs as input due to the known information imbalance among modalities. In addition, we leverage a gated control mechanism in the Transformer architecture to further improve the final output. Experimental results on three datasets (CMU-MOSI, CMU-MOSEI, and UR-FUNNY) verifies that our model significantly outperforms the SOTA. The implementation of this work is available at https://github.com/declare-lab/multimodal-deep-learning and https://github.com/declare-lab/BBFN.

Causal augmentation for causal sentence classification

FA Tan, D Hazarika, SK Ng, S Poria, R Zimmermann

Workshop on Causal Inference and NLP 2021
Scarcity of annotated causal texts leads to poor robustness when training state-of-the-art language models for causal sentence classification. In particular, we found that models misclassify on augmented sentences that have been negated or strengthened with respect to its causal meaning. This is worrying since minor linguistic differences in causal sentences can have disparate meanings. Therefore, we propose the generation of counterfactual causal sentences by creating contrast sets (Gardner et al., 2020) to be included during model training. We experimented on two model architectures and predicted on two out-of-domain corpora. While our strengthening schemes proved useful in improving model performance, for negation, regular edits were insufficient. Thus, we also introduce heuristics like shortening or multiplying root words of a sentence. By including a mixture of edits when training, we achieved performance improvements beyond the baseline across both models, and within and out of corpus’ domain, suggesting that our proposed augmentation can also help models generalize.

CIDER: Commonsense inference for dialogue explanation and reasoning

D Ghosal, P Hong, S Shen, N Majumder, R Mihalcea, S Poria

SIGDIAL 2021
Commonsense inference to understand and explain human language is a fundamental research problem in natural language processing. Explaining human conversations poses a great challenge as it requires contextual understanding, planning, inference, and several aspects of reasoning including causal, temporal, and commonsense reasoning. In this work, we introduce CIDER – a manually curated dataset that contains dyadic dialogue explanations in the form of implicit and explicit knowledge triplets inferred using contextual commonsense inference. Extracting such rich explanations from conversations can be conducive to improving several downstream applications. The annotated triplets are categorized by the type of commonsense knowledge present (e.g., causal, conditional, temporal). We set up three different tasks conditioned on the annotated dataset: Dialogue-level Natural Language Inference, Span Extraction, and Multi-choice Span Selection. Baseline results obtained with transformer-based models reveal that the tasks are difficult, paving the way for promising future research. The dataset and the baseline implementations are publicly available at https://github.com/declare-lab/CIDER .

Deep neural approaches to relation triplets extraction: a comprehensive survey

T Nayak, N Majumder, P Goyal, S Poria

Cognitive Computation 2021 78 citations
The task of relation extraction is about identifying entities and relations among them in free text for the enrichment of structured knowledge bases (KBs). In this paper, we present a comprehensive survey of this important research topic in natural language processing. Recently, with the advances made in the continuous representation of words (word embeddings) and deep neural architectures, many research works are published in the area of relation extraction. To help future research, we present a comprehensive review of the recently published research works in relation extraction. Previous surveys on this task covered only one aspect of relation extraction that is pipeline-based relation extraction approaches at the sentence level. In this survey, we cover sentence-level relation extraction to document-level relation extraction, pipeline-based approaches to joint extraction approaches, annotated datasets to distantly supervised datasets along with few very recent research directions such as zero-shot or few-shot relation extraction, noise mitigation in distantly supervised datasets. Regarding neural architectures, we cover convolutional models, recurrent network models, attention network models, and graph convolutional models in this survey. We survey more than 100 publications in the field of relation extraction and present them in a structured way based on their similarity in the specific task they tried to solve, their model architecture, the datasets they used for experiments. We include the current state-of-the-art performance in several datasets in this paper for comparison. In this paper, we have covered different aspects of research in relation extraction field with a key focus on recent deep neural network-based methods. Also, we identify possible future research directions. Hopefully, this will help future researchers to identify the current research gaps and take the field forward.

Discriminative dictionary design for action classification in still images and videos

A Roy, B Banerjee, A Hussain, S Poria

Cognitive Computation 2021
In this paper, we address the problem of action recognition from still images and videos. Traditional local features such as SIFT and STIP invariably pose two potential problems: 1) they are not evenly distributed in different entities of a given category and 2) many of such features are not exclusive of the visual concept the entities represent. In order to generate a dictionary taking the aforementioned issues into account, we propose a novel discriminative method for identifying robust and category specific local features which maximize the class separability to a greater extent. Specifically, we pose the selection of potent local descriptors as filtering-based feature selection problem, which ranks the local features per category based on a novel measure of distinctiveness. The underlying visual entities are subsequently represented based on the learned dictionary, and this stage is followed by action classification using the random forest model followed by label propagation refinement. The framework is validated on the action recognition datasets based on still images (Stanford-40) as well as videos (UCF-50). We get 51.2% and 66.7% recognition accuracy for Standford-40 and UCF-50, respectively. Compared to other representative methods from the literature, our approach exhibits superior performances. This proves the effectiveness of adaptive ranking methodology presented in this work.

DOZEN: Cross-domain zero shot named entity recognition with knowledge graph

HV Nguyen, F Gelli, S Poria

SIGIR 2021
With the new developments of natural language processing, increasing attention has been given to the task of Named Entity Recognition (NER). However, the vast majority of work focus on a small number of large-scale annotated datasets with a limited number of entities such as person, location and organization. While other datasets have been introduced with domain-specific entities, the smaller size of these largely limits the applicability of state-of-the-art deep models. Even if there are promising new approaches for performing zero-shot learning (ZSL), they are not designed for a cross-domain settings. We propose Cross Domain Zero Shot Named Entity Recognition with Knowledge Graph (DOZEN), which learns the relations between entities across different domains from an existing ontology of external knowledge and a set of analogies linking entities and domains. Experiments performed on both large scale and domain-specific datasets indicate that DOZEN is the most suitable option to extracts unseen entities in a target dataset from a different domain.

Exploring the role of context in utterance-level emotion, act and intent classification in conversations: An empirical study

D Ghosal, N Majumder, R Mihalcea, S Poria

ACL Findings 2021
the user inten-tion and background. In recent years, a number of context-aware approaches have been proposed for various utterance-level dialogue understanding tasks. In this paper, we explore and quantify the role of context for different aspects of a dialogue, namely emotion, dialogue act, and intent identification, using state-of-the-art dialogue understanding methods as baselines. Specifically, we employ various perturbations to distort the context of a given utterance and study its impact on the different tasks and baselines. This provides us with insights into the fundamental context factors that have immediate implications on different aspects of a dialogue. Such insights may inspire more effective dialogue understanding models and provide support for future text generation approaches.

Improving distantly supervised relation extraction with self-ensemble noise filtering

T Nayak, N Majumder, S Poria

RANLP 2021
Distantly supervised models are very popular for relation extraction since we can obtain a large amount of training data using the distant supervision method without human annotation. In distant supervision, a sentence is considered as a source of a tuple if the sentence contains both entities of the tuple. However, this condition is too permissive and does not guarantee the presence of relevant relation-specific information in the sentence. As such, distantly supervised training data contains much noise which adversely affects the performance of the models. In this paper, we propose a self-ensemble filtering mechanism to filter out the noisy samples during the training process. We evaluate our proposed framework on the New York Times dataset which is obtained via distant supervision. Our experiments with multiple state-of-the-art neural relation extraction models show that our proposed filtering mechanism improves the robustness of the models and increases their F1 scores.

Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis

W Han, H Chen, S Poria

EMNLP 2021 702 citations
In multimodal sentiment analysis (MSA), the performance of a model highly depends on the quality of synthesized embeddings. These embeddings are generated from the upstream process called multimodal fusion, which aims to extract and combine the input unimodal raw data to produce a richer multimodal representation. Previous work either back-propagates the task loss or manipulates the geometric property of feature spaces to produce favorable fusion results, which neglects the preservation of critical task-related information that flows from input to the fusion results. In this work, we propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs (inter-modality) and between multimodal fusion result and unimodal input in order to maintain task-related information through multimodal fusion. The framework is jointly trained with the main task (MSA) to improve the performance of the downstream MSA task. To address the intractable issue of MI bounds, we further formulate a set of computationally simple parametric and non-parametric methods to approximate their truth value. Experimental results on the two widely used datasets demonstrate the efficacy of our approach.

Information fusion for affective computing and sentiment analysis

A Hussain, E Cambria, S Poria, A Hawalah, F Herrera

Information Fusion 2021
Emotions are intrinsically part of our mental activity and play a key role in communication and decision-making processes [1,2]. Emotion, cognition, and action interact in feedback loops and emotion can be viewed in a structural model tied to adaptation. Besides being important for the advancement of AI, detecting and interpreting emotional information is key in multiple areas of computer science, e.g., human-agent, -computer, and -robot interaction, but also e-learning, e-health, domotics, automotive, security, user profiling and personalization. In recent years, emotion and sentiment analysis has become increasingly popular also for processing social media data on social networks, online communities, blogs, Wikis, microblogging platforms, and other online collaborative media [3,4]. The distillation of knowledge from such a big amount of unstructured information, however, is an extremely difficult task, as the contents of today’s Web are perfectly suitable for human consumption, but remain hardly accessible to machines. The opportunity to capture the opinions of the general public about social events, political movements, company strategies, marketing campaigns, and product preferences has raised growing interest both within the scientific community, leading to many exciting open challenges, as well as in the business world, due to the remarkable benefits to be had from marketing [5] and financial market prediction [6]. Most of existing approaches to affective computing and sentiment analysis are still based on the syntactic representation of text, a method that relies mainly on word co-occurrence frequencies. Such algorithms are limited by the fact that they can only process information they can ‘see’. As human text processors, we do not have such limitations as every word we see activates a cascade of semantically related concepts, relevant episodes, emotions, and sensory experiences [7], all of which enable the completion of other complex NLP tasks such as subjectivity detection [8], anaphora resolution [9], personality recognition [10], and more. Information fusion can aid to mimic the way humans process and analyze text and, hence, overcome the limitations of standard approaches to affective computing and sentiment analysis.

Investigating gender bias in BERT

R Bhardwaj, N Majumder, S Poria

Cognitive Computation 2021 164 citations
In this work, we analyze the gender bias induced by BERT in downstream tasks. We also propose solutions to reduce gender bias. Contextual language models (CLMs) have pushed the NLP benchmarks to a new height. It has become a new norm to utilize CLM-provided word embeddings in downstream tasks such as text classification. However, unless addressed, CLMs are prone to learn intrinsic gender bias in the dataset. As a result, predictions of downstream NLP models can vary noticeably by varying gender words, such as replacing “he” to “she”, or even gender-neutral words. In this paper, we focus our analysis on a popular CLM, i.e., BERT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {BERT}$$\end{document}. We analyze the gender bias it induces in five downstream tasks related to emotion and sentiment intensity prediction. For each task, we train a simple regressor utilizing BERT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {BERT}$$\end{document}’s word embeddings. We then evaluate the gender bias in regressors using an equity evaluation corpus. Ideally and from the specific design, the models should discard gender informative features from the input. However, the results show a significant dependence of the system’s predictions on gender-particular words and phrases. We claim that such biases can be reduced by removing gender-specific features from word embedding. Hence, for each layer in BERT, we identify directions that primarily encode gender information. The space formed by such directions is referred to as the gender subspace in the semantic space of word embeddings. We propose an algorithm that finds fine-grained gender directions, i.e., one primary direction for each BERT layer. This obviates the need of realizing gender subspace in multiple dimensions and prevents other crucial information from being omitted. Experiments show that removing embedding components in gender directions achieves great success in reducing BERT-induced bias in the downstream tasks. The investigation reveals significant gender bias a contextualized language model ( i.e., BERT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {BERT}$$\end{document}) induces in downstream tasks. The proposed solution seems promising in reducing such biases.

M2H2: A multimodal multiparty hindi dataset for humor recognition in conversations

DS Chauhan, GV Singh, N Majumder, A Zadeh, A Ekbal, P Bhattacharyya, S Poria

ICMI 2021
Humor recognition in conversations is a challenging task that has recently gained popularity due to its importance in dialogue understanding, including in multimodal settings (i.e., text, acoustics, and visual). The few existing datasets for humor are mostly in English. However, due to the tremendous growth in multilingual content, there is a great demand to build models and systems that support multilingual information access. To this end, we propose a dataset for Multimodal Multiparty Hindi Humor (M2H2) recognition in conversations containing 6,191 utterances from 13 episodes of a very popular TV series ”Shrimaan Shrimati Phir Se”. Each utterance is annotated with humor/non-humor labels and encompasses acoustic, visual, and textual modalities. We propose several strong multimodal baselines and show the importance of contextual and multimodal information for humor recognition in conversations. The empirical results on M2H2 dataset demonstrate that multimodal information complements unimodal information for humor recognition. The dataset and the baselines are available at http://www.iitp.ac.in/~ai-nlp-ml/resources.html and https://github.com/declare-lab/M2H2-dataset.

More identifiable yet equally performant transformers for text classification

R Bhardwaj, N Majumder, S Poria, E Hovy

ACL 2021
Interpretability is an important aspect of the trustworthiness of a model’s predictions. Transformer’s predictions are widely explained by the attention weights, i.e., a probability distribution generated at its self-attention unit (head). Current empirical studies provide shreds of evidence that attention weights are not explanations by proving that they are not unique. A recent study showed theoretical justifications to this observation by proving the non-identifiability of attention weights. For a given input to a head and its output, if the attention weights generated in it are unique, we call the weights identifiable. In this work, we provide deeper theoretical analysis and empirical observations on the identifiability of attention weights. Ignored in the previous works, we find the attention weights are more identifiable than we currently perceive by uncovering the hidden role of the key vector. However, the weights are still prone to be non-unique attentions that make them unfit for interpretation. To tackle this issue, we provide a variant of the encoder layer that decouples the relationship between key and value vector and provides identifiable weights up to the desired length of the input. We prove the applicability of such variations by providing empirical justifications on varied text classification tasks. The implementations are available at https://github.com/declare-lab/identifiable-transformers .

Persuasive dialogue understanding: The baselines and negative results

H Chen, D Ghosal, N Majumder, A Hussain, S Poria

Neurocomputing 2021
Persuasion aims at forming one's opinion and action via a series of persuasive messages containing persuader's strategies. Due to its potential application in persuasive dialogue systems, the task of persuasive strategy recognition has gained much attention lately. Previous methods on user intent recognition in dialogue systems adopt recurrent neural network (RNN) or convolutional neural network (CNN) to model context in conversational history, neglecting the tactic history and intra-speaker relation. In this paper, we demonstrate the limitations of a Transformer-based approach coupled with Conditional Random Field (CRF) for the task of persuasive strategy recognition. In this model, we leverage inter- and intra-speaker contextual semantic features, as well as label dependencies to improve the recognition. Despite extensive hyper-parameter optimizations, this architecture fails to outperform the baseline methods. We observe two negative results. Firstly, CRF cannot capture persuasive label dependencies, possibly as strategies in persuasive dialogues do not follow any strict grammar or rules as the cases in Named Entity Recognition (NER) or part-of-speech (POS) tagging. Secondly, the Transformer encoder trained from scratch is less capable of capturing sequential information in persuasive dialogues than Long Short-Term Memory (LSTM). We attribute this to the reason that the vanilla Transformer encoder does not efficiently consider relative position information of sequence elements.

Phonetic-enriched text representation for Chinese sentiment analysis with reinforcement learning

H Peng, Y Ma, S Poria, Y Li, E Cambria

Information Fusion 2021 70 citations
The Chinese pronunciation system offers two characteristics that distinguish it from other languages: deep phonemic orthography and intonation variations. We are the first to argue that these two important properties can play a major role in Chinese sentiment analysis. Particularly, we propose two effective features to encode phonetic information. Next, we develop a Disambiguate Intonation for Sentiment Analysis (DISA) network using a reinforcement network. It functions as disambiguating intonations for each Chinese character (pinyin). Thus, a precise phonetic representation of Chinese is learned. Furthermore, we also fuse phonetic features with textual and visual features in order to mimic the way humans read and understand Chinese text. Experimental results on five different Chinese sentiment analysis datasets show that the inclusion of phonetic features significantly and consistently improves the performance of textual and visual representations and outshines the state-of-the-art Chinese character level representations.

Recognizing emotion cause in conversations

S Poria, N Majumder, D Hazarika, D Ghosal, R Bhardwaj, SYB Jian, P Hong, R Ghosh, A Roy, N Chhaya, A Gelbukh, R Mihalcea

Cognitive Computation 2021 195 citations
We address the problem of recognizing emotion cause in conversations, define two novel sub-tasks of this problem, and provide a corresponding dialogue-level dataset, along with strong transformer-based baselines. The dataset is available at https://github.com/declare-lab/RECCON. Recognizing the cause behind emotions in text is a fundamental yet under-explored area of research in NLP. Advances in this area hold the potential to improve interpretability and performance in affect-based models. Identifying emotion causes at the utterance level in conversations is particularly challenging due to the intermingling dynamics among the interlocutors. We introduce the task of Recognizing Emotion Cause in CONversations with an accompanying dataset named RECCON, containing over 1,000 dialogues and 10,000 utterance cause/effect pairs. Furthermore, we define different cause types based on the source of the causes, and establish strong Transformer-based baselines to address two different sub-tasks on this dataset. Our transformer-based baselines, which leverage contextual pre-trained embeddings, such as RoBERTa, outperform the state-of-the-art emotion cause extraction approaches on our dataset. We introduce a new task highly relevant for (explainable) emotion-aware artificial intelligence: recognizing emotion cause in conversations, provide a new highly challenging publicly available dialogue-level dataset for this task, and give strong baseline results on this dataset.

Retrieving and reading: A comprehensive survey on open-domain question answering

F Zhu, W Lei, C Wang, J Zheng, S Poria, TS Chua

arXiv preprint arXiv:2101.00774 2021 453 citations
Abstract:Open-domain Question Answering (OpenQA) is an important task in Natural Language Processing (NLP), which aims to answer a question in the form of natural language based on large-scale unstructured documents. Recently, there has been a surge in the amount of research literature on OpenQA, particularly on techniques that integrate with neural Machine Reading Comprehension (MRC). While these research works have advanced performance to new heights on benchmark datasets, they have been rarely covered in existing surveys on QA systems. In this work, we review the latest research trends in OpenQA, with particular attention to systems that incorporate neural MRC techniques. Specifically, we begin with revisiting the origin and development of OpenQA systems. We then introduce modern OpenQA architecture named "Retriever-Reader" and analyze the various systems that follow this architecture as well as the specific techniques adopted in each of the components. We then discuss key challenges to developing OpenQA systems and offer an analysis of benchmarks that are commonly used. We hope our work would enable researchers to be informed of the recent advancement and also the open challenges in OpenQA research, so as to stimulate further progress in this field.

STACK: Sentence ordering with temporal commonsense knowledge

D Ghosal, N Majumder, R Mihalcea, S Poria

EMNLP 2021
Sentence order prediction is the task of finding the correct order of sentences in a randomly ordered document. Correctly ordering the sentences requires an understanding of coherence with respect to the chronological sequence of events described in the text. Document-level contextual understanding and commonsense knowledge centered around these events are often essential in uncovering this coherence and predicting the exact chronological order. In this paper, we introduce STaCK — a framework based on graph neural networks and temporal commonsense knowledge to model global information and predict the relative order of sentences. Our graph network accumulates temporal evidence using knowledge of ‘past’ and ‘future’ and formulates sentence ordering as a constrained edge classification problem. We report results on five different datasets, and empirically show that the proposed method is naturally suitable for order prediction. The implementation of this work is available at: https://github.com/declare-lab/sentence-ordering .

2020

Anaphora and coreference resolution: A review

R Sukthanker, S Poria, E Cambria, R Thirunavukarasu

Information Fusion 2020 303 citations
Entity resolution aims at resolving repeated references to an entity in a document and forms a core component of natural language processing (NLP) research. This field possesses immense potential to improve the performance of other NLP fields like machine translation, sentiment analysis, paraphrase detection, summarization, etc. The area of entity resolution in NLP has seen proliferation of research in two separate sub-areas namely: anaphora resolution and coreference resolution. Through this review article, we aim at clarifying the scope of these two tasks in entity resolution. We also carry out a detailed analysis of the datasets, evaluation metrics and research methods that have been adopted to tackle this NLP problem. This survey is motivated with the aim of providing the reader with a clear understanding of what constitutes this NLP problem and the issues that require attention.

Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research

S Poria, D Hazarika, N Majumder, R Mihalcea

IEEE Transactions on Affective Computing 2020 428 citations
Sentiment analysis as a field has come a long way since it was first introduced as a task nearly 20 years ago. It has widespread commercial applications in various domains like marketing, risk management, market research, and politics, to name a few. Given its saturation in specific subtasks — such as sentiment polarity classification — and datasets, there is an underlying perception that this field has reached its maturity. In this article, we discuss this perception by pointing out the shortcomings and under-explored, yet key aspects of this field necessary to attain true sentiment understanding. We analyze the significant leaps responsible for its current relevance. Further, we attempt to chart a possible course for this field that covers many overlooked and unanswered questions.

CMU-MOSEAS: A multimodal language dataset for Spanish, Portuguese, German and French

AAB Zadeh, Y Cao, S Hessner, PP Liang, S Poria, LP Morency

EMNLP 2020 103 citations
Modeling multimodal language is a core research area in natural language processing. While languages such as English have relatively large multimodal language resources, other widely spoken languages across the globe have few or no large-scale datasets in this area. This disproportionately affects native speakers of languages other than English. As a step towards building more equitable and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for Spanish, Portuguese, German and French. The proposed dataset, called CMU-MOSEAS (CMU Multimodal Opinion Sentiment, Emotions and Attributes), is the largest of its kind with 40,000 total labelled sentences. It covers a diverse set topics and speakers, and carries supervision of 20 labels including sentiment (and subjectivity), emotions, and attributes. Our evaluations on a state-of-the-art multimodal model demonstrates that CMU-MOSEAS enables further research for multilingual studies in multimodal language.

Conversational Transfer Learning for Emotion Recognition

D Hazarika, S Poria, R Zimmermann, R Mihalcea

Information Fusion 2020 146 citations
Recognizing emotions in conversations is a challenging task due to the presence of contextual dependencies governed by self- and inter-personal influences. Recent approaches have focused on modeling these dependencies primarily via supervised learning. However, purely supervised strategies demand large amounts of annotated data, which is lacking in most of the available corpora in this task. To tackle this challenge, we look at transfer learning approaches as a viable alternative. Given the large amount of available conversational data, we investigate whether generative conversational models can be leveraged to transfer affective knowledge for detecting emotions in context. We propose an approach, TL-ERC, where we pre-train a hierarchical dialogue model on multi-turn conversations (source) and then transfer its parameters to a conversational emotion classifier (target). In addition to the popular practice of using pre-trained sentence encoders, our approach also incorporates recurrent parameters that model inter-sentential context across the whole conversation. Based on this idea, we perform several experiments across multiple datasets and find improvement in performance and robustness against limited training data. TL-ERC also achieves better validation performances in significantly fewer epochs. Overall, we infer that knowledge acquired from dialogue generators can indeed help recognize emotions in conversations.

COSMIC: Commonsense knowledge for emotion identification in conversations

D Ghosal, N Majumder, A Gelbukh, R Mihalcea, S Poria

EMNLP Findings 2020 518 citations
In this paper, we address the task of utterance level emotion recognition in conversations using commonsense knowledge. We propose COSMIC, a new framework that incorporates different elements of commonsense such as mental states, events, and causal relations, and build upon them to learn interactions between interlocutors participating in a conversation. Current state-of-theart methods often encounter difficulties in context propagation, emotion shift detection, and differentiating between related emotion classes. By learning distinct commonsense representations, COSMIC addresses these challenges and achieves new state-of-the-art results for emotion recognition on four different benchmark conversational datasets. Our code is available at https://github.com/declare-lab/conv-emotion .

Dialogue systems with audio context

T Young, V Pandelea, S Poria, E Cambria

Neurocomputing 2020
Abstract Research on building dialogue systems that converse with humans naturally has recently attracted a lot of attention. Most work on this area assumes text-based conversation, where the user message is modeled as a sequence of words in a vocabulary. Real-world human conversation, in contrast, involves other modalities, such as voice, facial expression and body language, which can influence the conversation significantly in certain scenarios. In this work, we explore the impact of incorporating the audio features of the user message into generative dialogue systems. Specifically, we first design an auxiliary response retrieval task for audio representation learning. Then, we use word-level modality fusion to incorporate the audio features as additional context to our main generative model. Experiments show that our audio-augmented model outperforms the audio-free counterpart on perplexity, response diversity and human evaluation.

KinGDOM: Knowledge-Guided DOMain adaptation for sentiment analysis

D Ghosal, D Hazarika, A Roy, N Majumder, R Mihalcea, S Poria

ACL 2020 99 citations
Cross-domain sentiment analysis has received significant attention in recent years, prompted by the need to combat the domain gap between different applications that make use of sentiment analysis. In this paper, we take a novel perspective on this task by exploring the role of external commonsense knowledge. We introduce a new framework, KinGDOM, which utilizes the ConceptNet knowledge graph to enrich the semantics of a document by providing both domain-specific and domain-general background concepts. These concepts are learned by training a graph convolutional autoencoder that leverages inter-domain concepts in a domain-invariant manner. Conditioning a popular domain-adversarial baseline method with these learned concepts helps improve its performance over state-of-the-art approaches, demonstrating the efficacy of our proposed framework.

MIME: MIMicking emotions for empathetic response generation

N Majumder, P Hong, S Peng, J Lu, D Ghosal, A Gelbukh, R Mihalcea, S Poria

EMNLP 2020 318 citations
Current approaches to empathetic response generation view the set of emotions expressed in the input text as a flat structure, where all the emotions are treated uniformly. We argue that empathetic responses often mimic the emotion of the user to a varying degree, depending on its positivity or negativity and content. We show that the consideration of these polarity-based emotion clusters and emotional mimicry results in improved empathy and contextual relevance of the response as compared to the state-of-the-art. Also, we introduce stochasticity into the emotion mixture that yields emotionally more varied empathetic responses than the previous work. We demonstrate the importance of these factors to empathetic response generation using both automatic- and human-based evaluations. The implementation of MIME is publicly available at https://github.com/declare-lab/MIME .

MISA: Modality-invariant and-specific representations for multimodal sentiment analysis

D Hazarika, R Zimmermann, S Poria

ACM MM 2020 1496 citations
Human emotion judgments usually receive information from multiple modalities such as language, audio, as well as facial expressions and gestures. Because different modalities are represented differently, multimodal data exhibit redundancy and complementarity, so a reasonable multimodal fusion approach is essential to improve the accuracy of sentiment analysis. Inspired by the Crossmodal Transformer for multimodal data fusion in the MulT (Multimodal Transformer) model, this paper adds the Crossmodal transformer for modal enhancement of different modal data in the fusion part of the MISA (Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis) model, and proposes three MISA-CT models. Tested on two publicly available multimodal sentiment analysis datasets MOSI and MOSEI, the experimental results of the models outperformed the original MISA model.

MTAG: Modal-temporal attention graph for unaligned human multimodal language sequences

J Yang, Y Wang, R Yi, Y Zhu, A Rehman, A Zadeh, S Poria, LP Morency

NAACL 2020 176 citations
Human communication is multimodal in nature; it is through multiple modalities such as language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Modal-Temporal Attention Graph (MTAG). MTAG is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions across modalities and through time. Then, a novel graph fusion operation, called MTAG fusion, along with a dynamic pruning and read-out technique, is designed to efficiently process this modal-temporal graph and capture various interactions. By learning to focus only on the important interactions within the graph, MTAG achieves state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks, while utilizing significantly fewer model parameters.

SenticNet 6: Ensemble application of symbolic and subsymbolic AI for sentiment analysis

E Cambria, Y Li, FZ Xing, S Poria, K Kwok

CIKM 2020 613 citations
Deep learning has unlocked new paths towards the emulation of the peculiarly-human capability of learning from examples. While this kind of bottom-up learning works well for tasks such as image classification or object detection, it is not as effective when it comes to natural language processing. Communication is much more than learning a sequence of letters and words: it requires a basic understanding of the world and social norms, cultural awareness, commonsense knowledge, etc.; all things that we mostly learn in a top-down manner. In this work, we integrate top-down and bottom-up learning via an ensemble of symbolic and subsymbolic AI tools, which we apply to the interesting problem of polarity detection from text. In particular, we integrate logical reasoning within deep learning architectures to build a new version of SenticNet, a commonsense knowledge base for sentiment analysis.

Social media marketing and financial forecasting

F Xing, S Poria, E Cambria, R Welsch

Information Processing & Management 2020
The last decade has witnessed a huge development in social media interactions. To2 day, social media is becoming ubiquitous in two senses: (1) it crosses boundaries and 3 expands to more nations, and (2) it deeply intervenes in our personal life and economic 4 activities. This outburst of social media data has led to an extensive amount of research 5 for analyzing and extracting useful knowledge and workable patterns from social media 6 data (Hussain and Cambria, 2018). Among those, however, the academic interest in 7 social media marketing and financial forecasting has only increased in recent years (Xing 8 et al., 2018b). For example, only one special issue (Hajli and Laroche, 2019) and two 9 workshops (Hahn et al., 2018; Chen et al., 2019) have been organized on these two topics, 10 respectively. 11 Social media marketing aims to communicate with potential customers to build brand 12 loyalty, attract attention, and increase sales (Cambria et al., 2012). As the marketing 13 activities migrate to an online interactive environment, companies can listen-in the cus14 tomer feedbacks as well as actively influence customer behavior (Constantinides, 2009). 15 Social media-based financial forecasting involves applications such as asset allocation 16 strategies, pricing and valuing financial assets, and stock market prediction (Xing et al., 17 2018a; Malandri et al., 2018). These applications are not static procedures, to the con18 trary, they continue to collect information and accordingly revise the output. Both 19 social media marketing and financial forecasting require understanding and leveraging 20 the characteristics of social media data. 21 We identify three major characteristics of social media data. First, the volume of 22 social media data is huge. Moreover, marketing and financial forecasting tasks often 23 require real-time analysis of those data. Hence the processing methods have to be effi24 cient. Second, social media data are becoming multi-modal. Even pure-text data often 25 comes with extra dimensions, such as timestamps and location tags. Making use of such 26 information is what social media research has more than NLP research. Third, before 27 the surge of social media data, there are existing information sources to support market28 ing and financial forecasting. Therefore, methods often have to fuse the knowledge from 29 social media data and other sources. 30 In line with the above discussion, this special issue focuses on the presentation of 31 marketing and financial forecasting researches that tackle the three characteristics of 32 social media data. 33

Utterance-level dialogue understanding: An empirical study

D Ghosal, N Majumder, R Mihalcea, S Poria

NAACL 2020
Abstract:The recent abundance of conversational data on the Web and elsewhere calls for effective NLP systems for dialog understanding. Complete utterance-level understanding often requires context understanding, defined by nearby utterances. In recent years, a number of approaches have been proposed for various utterance-level dialogue understanding tasks. Most of these approaches account for the context for effective understanding. In this paper, we explore and quantify the role of context for different aspects of a dialogue, namely emotion, intent, and dialogue act identification, using state-of-the-art dialog understanding methods as baselines. Specifically, we employ various perturbations to distort the context of a given utterance and study its impact on the different tasks and baselines. This provides us with insights into the fundamental contextual controlling factors of different aspects of a dialogue. Such insights can inspire more effective dialogue understanding models, and provide support for future text generation approaches. The implementation pertaining to this work is available at this https URL.

2019

An Attention-Based Model for Learning Dynamic Interaction Networks

S Cavallari, S Poria, E Cambria, VW Zheng, H Cai

IJCNN 2019
In the physical world, complex systems are generally created as the composition of multiple primitive components that interact with each other rather than a single monolithic structure. Recently, spatio-temporal graphs received a reasonable amount of attention from the research community since they emerged as a natural representational tool able to capture the interactive and interrelated structure of a complex problem. To better understand the nature of complex systems, there is the need to define models that can easily explain the learned causal relationship. To this end, we propose an attentive model able to learn and project the relational structure into a fixed-size embedding. Such representation naturally captures the dynamic influence that each neighbors exert over a given vertex providing a valuable description of the problem setting. The proposed architecture has been extensively evaluated against strong baselines on toy as well as real-world tasks, such as prediction of household energy load and traffic congestion.

DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation

D Ghosal, N Majumder, S Poria, N Chhaya, A Gelbukh

EMNLP 2019 1013 citations
Emotion recognition in conversation (ERC) has received much attention, lately, from researchers due to its potential widespread applications in diverse areas, such as health-care, education, and human resources. In this paper, we present Dialogue Graph Convolutional Network (DialogueGCN), a graph neural network based approach to ERC. We leverage self and inter-speaker dependency of the interlocutors to model conversational context for emotion recognition. Through the graph network, DialogueGCN addresses context propagation issues present in the current RNN-based methods and empirically outperforms prior state-of-the-art on several benchmark datasets.

DialogueRNN: An attentive RNN for emotion detection in conversations

N Majumder, S Poria, D Hazarika, R Mihalcea, A Gelbukh, E Cambria

AAAI 2019 1206 citations
Emotion detection in conversations is a necessary step for a number of applications, including opinion mining over chat history, social media threads, debates, argumentation mining, understanding consumer feedback in live conversations, and so on. Currently systems do not treat the parties in the conversation individually by adapting to the speaker of each utterance. In this paper, we describe a new method based on recurrent neural networks that keeps track of the individual party states throughout the conversation and uses this information for emotion classification. Our model outperforms the state-of-the-art by a significant margin on two different datasets.

Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances

S Poria, N Majumder, R Mihalcea, E Hovy

IEEE Access 2019 660 citations
Emotion is intrinsic to humans and consequently, emotion understanding is a key part of human-like artificial intelligence (AI). Emotion recognition in conversation (ERC) is becoming increasingly popular as a new research frontier in natural language processing (NLP) due to its ability to mine opinions from the plethora of publicly available conversational data on platforms such as Facebook, Youtube, Reddit, Twitter, and others. Moreover, it has potential applications in health-care systems (as a tool for psychological analysis), education (understanding student frustration), and more. In Addition, ERC is also extremely important for generating emotion-aware dialogues that require an understanding of the user’s emotions. Catering to these needs calls for effective and scalable conversational emotion-recognition algorithms. However, it is a difficult problem to solve because of several research challenges. In this paper, we discuss these challenges and shed light on recent research in this field. We also describe the drawbacks of these approaches and discuss the reasons why they fail to successfully overcome the research challenges in ERC.

Factorized multimodal transformer for multimodal sequential learning

A Zadeh, C Mao, K Shi, Y Zhang, PP Liang, S Poria, LP Morency

arXiv preprint arXiv:1911.09826 2019 67 citations
Abstract:The complex world around us is inherently multimodal and sequential (continuous). Information is scattered across different modalities and requires multiple continuous sensors to be captured. As machine learning leaps towards better generalization to real world, multimodal sequential learning becomes a fundamental research area. Arguably, modeling arbitrarily distributed spatio-temporal dynamics within and across modalities is the biggest challenge in this research area. In this paper, we present a new transformer model, called the Factorized Multimodal Transformer (FMT) for multimodal sequential learning. FMT inherently models the intramodal and intermodal (involving two or more modalities) dynamics within its multimodal input in a factorized manner. The proposed factorization allows for increasing the number of self-attentions to better model the multimodal phenomena at hand; without encountering difficulties during training (e.g. overfitting) even on relatively low-resource setups. All the attention mechanisms within FMT have a full time-domain receptive field which allows them to asynchronously capture long-range multimodal dynamics. In our experiments we focus on datasets that contain the three commonly studied modalities of language, vision and acoustic. We perform a wide range of experiments, spanning across 3 well-studied datasets and 21 distinct labels. FMT shows superior performance over previously proposed models, setting new state of the art in the studied datasets.

MELD: A multimodal multi-party dataset for emotion recognition in conversations

S Poria, D Hazarika, N Majumder, G Naik, E Cambria, R Mihalcea

ACL 2019 2205 citations
Emotion recognition in conversations is a challenging task that has recently gained popularity due to its potential applications. Until now, however, a large-scale multimodal multi-party emotional conversational database containing more than two speakers per dialogue was missing. Thus, we propose the Multimodal EmotionLines Dataset (MELD), an extension and enhancement of EmotionLines. MELD contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends. Each utterance is annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities. We propose several strong multimodal baselines and show the importance of contextual and multimodal information for emotion recognition in conversations. The full dataset is available at http://affective-meld.github.io.

Multi-task Learning for Multi-Modal Emotion Recognition and Sentiment Analysis

MS Akhtar, DS Chauhan, D Ghosal, S Poria, A Ekbal, P Bhattacharyya

NAACL 2019 298 citations
Related tasks often have inter-dependence on each other and perform better when solved in a joint framework. In this paper, we present a deep multi-task learning framework that jointly performs sentiment and emotion analysis both. The multi-modal inputs (i.e. text, acoustic and visual frames) of a video convey diverse and distinctive information, and usually do not have equal contribution in the decision making. We propose a context-level inter-modal attention framework for simultaneously predicting the sentiment and expressed emotions of an utterance. We evaluate our proposed approach on CMU-MOSEI dataset for multi-modal sentiment and emotion analysis. Evaluation results suggest that multi-task learning framework offers improvement over the single-task framework. The proposed approach reports new state-of-the-art performance for both sentiment analysis and emotion analysis.

Sentiment and Sarcasm Classification with Multitask Learning

N Majumder, S Poria, H Peng, N Chhaya, E Cambria, A Gelbukh

IEEE Intelligent Systems 2019 283 citations
Sentiment classification and sarcasm detection are both important natural language processing tasks. Sentiment is always coupled with sarcasm where intensive emotion is expressed. Nevertheless, most literature considers them as two separate tasks. We argue that knowledge in sarcasm detection can also be beneficial to sentiment classification and vice versa. We show that these two tasks are correlated, and present a multitask learning-based framework using a deep neural network that models this correlation to improve the performance of both tasks in a multitask learning setting. Our method outperforms the state of the art by 3–4% in the benchmark dataset.

The Nitty-GRITties of Success: Computational Analysis of Grit From Language

T Maheshwari, A Reganti, S Poria, R Mihalcea

IEEE Access 2019
Grit is a prominent personality trait that measures an individuals passion and perseverance for long-term goals. Grit entails that zeal and persistence of motive play a key role in determining an individuals success in the long run, as opposed to natural talent. But how does one identify and distinguish between gritty and non-gritty individuals? Do they use language differently? In this paper, we seek to answer these questions using a social media setting. We build a new crowd-sourced Twitter corpus that contains the posts of 464 users along with their grit scores, and explore how grit correlates with other major behavioural and personality traits. We then train machine learning classification models to predict the grit level of an individual using his/her Twitter posts, and show that language can be effectively used to infer this personality trait.

Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper)

S Castro, D Hazarika, V Pérez-Rosas, R Zimmermann, R Mihalcea, S Poria

ACL 2019 491 citations
Sarcasm is often expressed through several verbal and non-verbal cues, e.g., a change of tone, overemphasis in a word, a drawn-out syllable, or a straight looking face. Most of the recent work in sarcasm detection has been carried out on textual data. In this paper, we argue that incorporating multimodal cues can improve the automatic classification of sarcasm. As a first step towards enabling the development of multimodal approaches for sarcasm detection, we propose a new sarcasm dataset, Multimodal Sarcasm Detection Dataset (MUStARD), compiled from popular TV shows. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context of historical utterances in the dialogue, which provides additional information on the scenario where the utterance occurs. Our initial results show that the use of multimodal information can reduce the relative error rate of sarcasm detection by up to 12.9% in F-score when compared to the use of individual modalities. The full dataset is publicly available for use at https://github.com/soujanyaporia/MUStARD.