Publications
DeCLaRe Lab was founded in 2019, so this archive focuses on lab publications from 2019 onward. Use keyword search, citation status, PDF availability, or category filters. Links are visible by default; abstracts expand inline when available.
2026
10 Open Challenges Steering the Future of Vision-Language-Action Models
Due to their ability to follow natural language instructions, vision-language-action (VLA) models
are increasingly prevalent in the embodied AI arena. In this paper, we discuss 10 principal
milestones in the ongoing development of VLA models—multimodality, reasoning, data, evaluation,
cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and
coordination with humans. We also discuss emerging trends such as spatial understanding, modeling
world dynamics, post-training, and data synthesis, aiming to accelerate development of VLA models.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Video understanding requires identifying and reasoning over semantically discriminative visual
objects across frames, yet existing object-agnostic solutions struggle to effectively handle
substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a
search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning
step to specific visual evidence regions, enabling compositional and multi-step decision-making.
Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally
builds spatially grounded traces around task-relevant visual objects, thereby mitigating
over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided
controller, optimized via reinforcement learning with a format reward that significantly
incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable
reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive
evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench
benchmarks demonstrate consistent performance gains, robustness and generalization of
Chain-of-Glimpse across diverse video reasoning tasks.
Data Agent: Learning to Select Data via End-to-End Dynamic Optimization
Dynamic Data selection aims to accelerate training by prioritizing informative samples during
online training. However, existing methods typically rely on task-specific handcrafted metrics or
static/snapshot-based criteria to estimate sample importance, limiting scalability across learning
paradigms and making it difficult to capture the evolving utility of data throughout training. To
address this challenge, we propose Data Agent, an end-to-end dynamic data selection framework that
formulates data selection as a training-aware sequential decision-making problem. The agent learns a
sample-wise selection policy that co-evolves with model optimization, guided by a composite reward
that integrates loss-based difficulty and confidence-based uncertainty signals. The reward signals
capture complementary objectives of optimization impact and information gain, together with a
tuning-free adaptive weighting mechanism that balances these signals over training. Extensive
experiments across a wide range of datasets and architectures demonstrate that Data Agent
consistently accelerates training while preserving or improving performance, e.g., reducing costs by
over 50\% on ImageNet-1k and MMLU with lossless performance. Moreover, its dataset-agnostic
formulation and modular reward make it plug-and-play across tasks and scenarios, e.g., robustness to
noisy datasets, highlighting its potential in real-world scenarios.
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics
RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop
deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks
leak the reasoning path in the question text, allowing models to follow surface cues rather than
discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass
rate, which collapses diverse behaviours into one score and obscures whether failures stem from
inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present
WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia
sandbox that ensures full traceability of model actions, and a holistic evaluation framework that
separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25
state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with
knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate
refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at
executing given reasoning paths but fail when required to discover them. We develop an agentic
workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies,
incorporating verification loops and systematic evidence tracking that improve both search and
synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can
guide concrete architectural improvements, establishing our benchmark as a critical tool for
developing genuinely autonomous reasoning systems rather than pattern-following agents.
Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems
Individual agents in multi-agent (MA) systems often lack robustness, tending to blindly conform to
misleading peers. We show this weakness stems from both sycophancy and inadequate ability to
evaluate peer reliability. To address this, we first formalize the learning problem of history-aware
reference, introducing the historical interactions of peers as additional input, so that agents can
estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from
evaluating peer reasoning quality to estimating peer reliability based on interaction history. We
then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on
explicitly-built peer profiles from history. We further optimize ECL by reinforcement learning using
auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3-4B to
outperform a history-agnostic baseline 8x its size (Qwen 3-30B) by accurately identifying reliable
peers. ECL also boosts frontier models to near-perfect (100%) performance. We show that ECL
generalizes well to various MA configurations and we find that trust is modeled well by LLMs,
revealing a strong correlation in trust modeling accuracy and final answer quality.
From Perception to Action: An Interactive Benchmark for Vision Reasoning
Understanding the physical structure is essential for real-world applications such as embodied
agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model
(VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to
assess agents' ability to reason about how geometry, contact, and support relations jointly
constrain what actions are possible in a dynamic environment. To address this gap, we introduce the
Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven
testbed designed to evaluate whether models can understand, plan, and execute structured action
sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to
active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and
packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under
unified interactive settings. Our results show that top-performing models still struggle to
internalize physical structure and causal constraints, often failing to produce reliable
long-horizon plans and cannot robustly translate perceived structure into effective actions. The
project is available at https://social-ai-studio.github.io/CHAIN/.
LLM Alignment should go beyond Harmlessness–Helpfulness and incorporate Human Agency
Large Language Models are transforming communication, research, and decision-making, but
misalignment – when models diverge from human values, safety requirements, or user intent – poses
serious risks. In this position paper, we argue that many alignment failures stem from operational
choices in training and deployment. We posit that alignment should shift from static, post-training
constraints toward dynamic, participatory approaches that safeguard pluralism, autonomy, and human
flourishing. We outline forward-looking directions, including pluralistic evaluation, transparency,
and the Flourishing–Justice–Autonomy (FJA) framework, and present a roadmap for advancing alignment
research and practice.
LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions
Large language models (LLMs) are increasingly integrated into multi-agent systems (MAS), where peer
interactions shape individual decisions. While prior work has mainly examined conformity bias, we
broaden the view to include how LLMs build rapport from prior interactions, discern and integrate
high-quality peer information, and resist misleading inputs-abilities essential for achieving
collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark that
simulates quiz-style collaboration with peer agents whose rapport levels and behaviours can be
precisely controlled in both historical interactions and the current round. This unified setup
enables systematic analysis of how rapport, peer actions, and the model's self-confidence jointly
influence decision-making. Using KAIROS, we evaluate prompting, supervised fine-tuning, and
reinforcement learning via Group Relative Policy Optimisation (GRPO). Results show that model scale
is a primary factor moderating susceptibility to social influence: larger models are more resilient
and benefit from prompting-based mitigation, whereas smaller models remain vulnerable. Only
carefully configured GRPO training yields consistent robustness and performance gains for small
models.
OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale
deployment. While most studies and global discussions focus on generic harms, such as models
assisting users in harming themselves or others, enterprises face a more fundamental concern:
whether LLM-based agents are safe for their intended use case. To address this, we introduce
operational safety, defined as an LLM's ability to appropriately accept or refuse user queries when
tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark
for measuring operational safety both in general and within specific agentic use cases. Our
evaluations on six model families comprising 20 open-weight LLMs reveal that while performance
varies across models, all of them remain highly operationally unsafe. Even the strongest models -
Qwen-3 (235B) with 77.77% and Mistral (24B) with 79.96% - fall far short of reliable operational
safety, while GPT models plateau in the 62-73% range, Phi achieves only mid-level scores (48-70%),
and Gemma and Llama-3 collapse to 39.53% and 23.84%, respectively. While operational safety is a
core model alignment issue, to suppress these failures, we propose prompt-based steering methods:
query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD
refusal. Q-ground provides consistent gains of up to 23%, while P-ground delivers even larger
boosts, raising Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%. These results highlight both the
urgent need for operational safety interventions and the promise of prompt-based steering as a first
step toward more reliable LLM-based agents.
Stacked from One: Multi-Scale Self-Injection for Context Window Extension
The limited context window of contemporary large language models (LLMs) remains a primary
bottleneck for their broader application across diverse domains. Although continual pre-training on
long-context data offers a straightforward solution, it incurs prohibitive data acquisition and
computational costs. To address this challenge, we propose~\modelname, a novel framework based on
multi-grained context compression and query-aware information acquisition. SharedLLM comprises two
stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a
decoder. The lower model compresses long inputs into compact, multi-grained representations, which
are then forwarded to the upper model for context-aware processing. To maximize efficiency, this
information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and
redundant cross-attention operations. This entire process, wherein the upper and lower models are
derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this
architecture, a specialized tree-based data structure enables the efficient encoding and query-aware
retrieval of contextual information. Despite being trained on sequences of only 8K tokens,
\modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of
long-context modeling and understanding benchmarks, \modelname~achieves performance superior or
comparable to strong baselines, striking an optimal balance between efficiency and accuracy.
Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and
yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder
architectures).
Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization
★
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters,
capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A
key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA
lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large
Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a
novel framework that iteratively generates and optimizes preference data to enhance TTA alignment.
We demonstrate that the audio preference dataset generated using CRPO outperforms existing
alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both
objective and subjective benchmarks. We open source all code and models to support further research
in TTA generation.
Tracking the Evolution of Multimodal Reasoning on Visual Puzzles
The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm
shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have
demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for
Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns,
whereas humans often perceive and reason about multimodal scenarios involving both vision and
language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in
multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models and
compare them against leading open-source alternatives on challenging multimodal puzzles from
PuzzleVQA and AlgoPuzzleVQA. Our results reveal that the o-[n] series, particularly later
iterations, significantly outperform both the GPT-[n] series and the evaluated open-source models,
establishing clear performance tiers. Nonetheless, despite these substantial advancements, our
findings highlight that even leading models face persistent challenges in perception and
compositional reasoning.
2025
Action-guided prompt tuning for video grounding
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We
claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the
human perception process of events. We propose Action-Guided Prompt Tuning (AGPT), which includes a
Prompt Exploration module to expand salient verb representations and an auxiliary action temporal
prediction task with a temporal rank loss to simulate human perceptual segmentation. AGPT integrates
seamlessly into existing models with minimal overhead and improves performance on video grounding
benchmarks.
Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning
This paper introduces the novel task of multimodal puzzle solving, framed within the context of
visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and
evaluate the capabilities of multimodal language models in solving algorithmic puzzles that
necessitate both visual understanding, language understanding, and complex algorithmic reasoning.
The dataset covers topics such as boolean logic, combinatorics, graph theory, optimization and
search, ensuring exact solutions are derivable algorithmically. Our investigation reveals that large
language models such as GPT4V and Gemini exhibit limited performance on these puzzles, often near
random in a multi-choice setup, highlighting the challenges of integrating visual, language, and
algorithmic knowledge.
Content extraction based on hop distance within a graph model
A method of categorizing text entries on a document can include determining, for each of a
plurality of text bounding boxes in the document, respective text, respective coordinates, and
respective input embeddings. The method may further include defining a graph of the plurality of
bounding boxes, the graph comprising a plurality of connections among the plurality of bounding
boxes, each connection comprising a first and second bounding box and zero or more respective
intermediate bounding boxes. The method may further include determining a respective attention value
for each connection according to a quantity of intermediate bounding boxes in the connection and,
based on the respective attention values and a transformer-based machine learning model applied to
the respective input embeddings and respective coordinates, determining output embeddings for each
bounding box and, based on the respective output embeddings, generating a bounding box label for
each bounding box.
DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors
Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive,
goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which
leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and
employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to
select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors
each decision to advance the task while nurturing a genuine, empathetic connection. Across
negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under 3
turns with success rates exceeding 94% and, with a larger LLM prior, pushes success above 97% while
markedly improving negotiation outcomes.
DiffPO: Diffusion-styled Preference Optimization for Inference Time Alignment of Large Language Models
Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However,
these approaches still face challenges, such as limited scalability due to policy-specific value
functions and latency during the inference phase. In this paper, we propose a novel approach,
Diffusion-styled Preference Optimization (DiffPO), which provides an efficient and policy-agnostic
solution for aligning LLMs with humans. By directly performing alignment at sentence level, DiffPO
avoids the time latency associated with token-level generation. Designed as a plug-and-play module,
DiffPO can be seamlessly integrated with various base models to enhance their alignment. Extensive
experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that DiffPO achieves superior
alignment performance across various settings, achieving a favorable trade-off between alignment
quality and inference-time latency. Furthermore, DiffPO demonstrates model-agnostic scalability,
significantly improving the performance of large models such as Llama-3-70B.
Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning
Traditional reinforcement learning-based robotic control methods are often task-specific and fail
to generalize across diverse environments or unseen objects and instructions. Visual Language Models
(VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to
generate actionable policies tailored to specific robotic embodiments. To address this,
Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial
reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model
with Grounded Chain of Thought and Look-ahead Spatial Reasoning, EMMA-X. EMMA-X leverages our
constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation
trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we
introduce a trajectory segmentation strategy based on gripper states and motion trajectories, which
can help mitigate hallucination in grounding subtask reasoning generation. Experimental results
demonstrate that EMMA-X achieves superior performance over competitive baselines, particularly in
real-world robotic tasks requiring spatial reasoning.
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision
Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and
reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify
only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation
toward coherent solutions. We introduce PathFinder-PRM, a hierarchical, error-aware discriminative
PRM that first classifies math and consistency errors at each step, then combines these fine-grained
signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by
enriching existing corpora with step-level labels. On PRMBench, PathFinder-PRM achieves
state-of-the- art PRMScore while using substantially less data, and improves reward-guided search
performance.
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck
in long-context language models employing softmax attention. We introduce Error-Free Linear
Attention (EFLA), a numerically stable, full parallelism and generalized formulation of the delta
rule. Specifically, we formulate the online learning update as a continuous-time dynamical system
and prove that its exact solution is not only attainable but also computable in linear time with
full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the
exact closed-form solution effectively. This attention mechanism is theoretically free from error
accumulation, perfectly capturing the continuous dynamics while preserving linear-time complexity.
Through experiments, we show EFLA enables robust performance in noisy environments and superior
downstream benchmark performance.
Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk Appetite?
We assess whether AI systems can credibly evaluate investment risk appetite—a task that must be
thoroughly validated before automation. Our analysis was conducted on proprietary systems (GPT,
Claude, Gemini) and open-weight models (LLaMA, DeepSeek, Mistral), using carefully curated user
profiles that reflect real users with varying attributes such as country and gender. As a result,
the models exhibit significant variance in score distributions when user attributes—such as country
or gender—that should not influence risk computation are changed. For example, GPT-4o assigns higher
risk scores to Nigerian and Indonesian profiles. While some models align closely with expected
scores in the low- and mid-risk ranges, none maintain consistent scores across regions and
demographics, thereby violating AI and finance regulations.
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions
Recent advancements in Large Language Models (LLMs) have showcased striking results on existing
logical reasoning benchmarks, with some models even surpassing human performance. However, the true
depth of their competencies and robustness in reasoning tasks remains an open question. To this end,
in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation.
Particularly, we introduce (i) a general ontology of perturbations for math and coding questions,
(ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, GSMORE and
HUMANEVAL-CORE, respectively, of perturbed math and coding problems to probe LLM capabilities in
numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and
open-source LLMs, we show a significant performance drop across all the models against the perturbed
questions, suggesting that the current LLMs lack robust problem solving skills and structured
reasoning abilities in many areas, as defined by our ontology. We open-source the datasets and
source codes at: https://github.com/declare-lab/LLM-ReasoningTest.
Ferret: Faster and effective automated red teaming with reward-based scoring technique
Automated red-teaming generates adversarial attacks to identify vulnerabilities in LLMs but can be
slow and resource-intensive. We propose Ferret, which enhances baseline Rainbow Teaming by producing
multiple adversarial prompt mutations per iteration and ranking them with scoring functions (reward
models, Llama Guard, LLM-as-a-judge). Ferret achieves higher attack success rates, greater
transferability across models, and faster time-to-target, with code available at
https://github.com/declare-lab/ferret.
From grounding to manipulation: Case studies of foundation model integration in embodied robotic systems
Foundation models (FMs) are increasingly applied to bridge language and action in embodied agents,
yet the operational characteristics of different integration strategies remain under-explored. We
investigate three paradigms for robotic systems: end-to-end vision-language-action models (VLAs),
modular pipelines using vision-language models (VLMs), and multimodal large language models (MLLMs).
Case studies on instruction grounding and object manipulation reveal trade-offs in scale,
generalization and data efficiency, providing design lessons for language-driven physical agents and
identifying opportunities and challenges for FM-powered robotics in real-world settings.
Harnessing large language models for scientific novelty detection
In an era of exponential scientific growth, identifying novel research ideas is crucial and
challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the
research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g.,
retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between
textual similarity and idea conception. In this paper, we propose to harness large language models
(LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP
domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers
based on their relationship, and then summarize their main ideas based on LLMs. To capture idea
conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from
LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM
novelty detection. Experiments show our method consistently outperforms others on the proposed
benchmark datasets for idea retrieval and ND tasks. Codes and data are available at
https://anonymous.4open.science/r/NoveltyDetection-10FB/.
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent
times. These models are increasingly capable of generating high quality and faithful audio outputs
capturing to speech and acoustic events. However, there is still much room for improvement in
creative audio generation that primarily involves music and songs. Recent open lyrics-to-song
models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song
generation for recreational use. However, these models lack fine-grained word-level controllability
often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based
JAM is the first effort toward endowing word-level timing and duration control in song generation,
allowing fine-grained vocal control. To enhance the quality of generated songs to better align with
human preferences, we implement aesthetic alignment through Direct Preference Optimization, which
iteratively refines the model using a synthetic dataset, eliminating the need or manual data
annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through
our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the
music-specific attributes.
Lessons from Training Grounded LLMs with Verifiable Rewards
Generating grounded and trustworthy responses remains a key challenge for large language models
(LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise,
instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly
stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore
how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the
GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based
rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring
gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA,
QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform
instruction-only variants, especially in handling unanswerable queries and generating well-cited
responses. A two-stage training setup, first optimizing answer and citation behavior and then
refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit
instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance
on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning,
stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.
Libra-leaderboard: Towards responsible AI through a balanced leaderboard of safety and capability
As large language models (LLMs) continue to evolve, leaderboards play a significant role in
steering their development. Existing leaderboards often prioritize model capabilities while
overlooking safety concerns, leaving a significant gap in responsible AI development. To address
this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a
balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive
LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike
traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a
distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes
models to achieve a balance rather than excelling in one dimension at the expense of some other
ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading
organizations, identifying critical safety challenges even in state-of-the-art models.
M-longdoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning framework
The ability to understand and answer questions over documents can be useful in many business and
practical applications. However, documents often contain lengthy and diverse multimodal contents
such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly.
Hence, there is an urgent need to develop effective and automated methods to aid humans in this
task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework
to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning
approach for efficient and effective multimodal document reading. Compared to existing works, our
benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring
open-ended explanations and not just extractive answers. To our knowledge, our training framework is
the first to directly address the retrieval setting for multimodal long documents. To enhance open
models, we construct a training corpus in a fully automatic manner. Experiments show that our tuning
approach significantly improves the correctness of model responses by 4.6%.
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
LLMs are an integral component of retrieval-augmented generation (RAG) systems. While many studies
focus on evaluating the overall quality of end-to-end RAG systems, there is a gap in understanding
the appropriateness of LLMs for the RAG task. To address this, we introduce Trust-Score, a holistic
metric that evaluates the trustworthiness of LLMs within the RAG framework. Our results show that
various prompting methods, such as in-context learning, fail to effectively adapt LLMs to the RAG
task as measured by Trust-Score. Consequently, we propose Trust-Align, a method to align LLMs for
improved Trust-Score performance. 26 out of 27 models aligned using Trust-Align substantially
outperform competitive baselines on ASQA, QAMPARI, and ELI5. Specifically, in LLaMA-3-8b,
Trust-Align outperforms FRONT on ASQA (up 12.56), QAMPARI (up 36.04), and ELI5 (up 17.69).
Trust-Align also significantly enhances models' ability to correctly refuse and provide quality
citations. We also demonstrate the effectiveness of Trust-Align across different open-weight models,
including the LLaMA series (1b to 8b), Qwen-2.5 series (0.5b to 7b), and Phi3.5 (3.8b). We release
our code at https://github.com/declare-lab/trust-align.
MM-InstructEval: Zero-shot evaluation of (Multimodal) Large Language Models on multimodal reasoning tasks
★
The emergence of multimodal large language models (MLLMs) has triggered extensive research in model
evaluation. While existing evaluation studies primarily focus on unimodal (vision-only)
comprehension and reasoning capabilities, they overlook critical assessments of complex multimodal
reasoning tasks that require integrated understanding of both visual and textual contexts. Such
multimodal tasks present unique challenges, demanding sophisticated reasoning across multiple
modalities and deep comprehension of multimodal contexts. In this paper, we present MM-InstructEval,
a comprehensive evaluation framework that incorporates diverse metrics to assess model performance
across various multimodal reasoning tasks with vision-text contexts. We conduct extensive zero-shot
evaluations on 45 models (including 36 MLLMs) across 16 multimodal datasets, encompassing 6 distinct
tasks using 10 different instructions. Our framework introduces multiple innovative metrics,
including the 'Best Performance' metric to benchmark peak model capabilities, the 'Mean Relative
Gain' metric to assess overall efficacy across models and instructions, the 'Stability' metric to
measure robustness, and the 'Adaptability' metric to quantify the compatibility between models and
instructions. Through comprehensive evaluation and analysis, we uncover several significant insights
about model architectures, instruction formats, and their interactions in multimodal reasoning
tasks. Our findings establish new benchmarks for assessing the reasoning capabilities of MLLMs and
provide strategic guidance for future developments. To facilitate continued research and evaluation
in this field, we release our framework and resources at
https://github.com/declare-lab/MM-InstructEval, with an interactive leaderboard available at
MM-InstructEval Leaderboard (https://declare-lab.github.io/MM-InstructEval/).
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
★
We investigate whether LLMs can automatically discover novel and valid chemistry research
hypotheses given only a research question. Using a benchmark of 51 chemistry papers, we break the
task into retrieving inspirations and generating hypotheses. We develop an LLM-based multi-agent
framework and show the proposed method can rediscover many hypotheses with high similarity to ground
truth.
Multimodal analysis combining monitoring modalities to elicit cognitive states and perform screening for mental disorders
Embodiments provide techniques for mental health screening using multimodal analysis (video, audio,
speech). Features are extracted per modality and fused (late fusion) to classify fused features
using trained models for detection of disorders such as depression, anxiety, and PTSD. The system
can generate a disorder-state representation and be integrated into telehealth platforms for
screening and monitoring.
Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards
Vision--language--action (VLA) models have recently shown promising performance on a variety of
embodied tasks, yet they still fall short in reliability and generalization, especially when
deployed across different embodiments or real-world environments. In this work, we introduce
NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based
action expert. This architectural enhancement alone yields substantial performance gains, enabling
NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and
real-world benchmarks. To further improve robustness and task success, we develop a set of reward
models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model
(WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a
deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these
reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through
direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training
consistently improves performance in both simulation and real-robot settings, demonstrating
significant VLA model-reliability gains through simple yet effective reward models. Our findings
highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied
agents suitable for real-world deployment.
Nora: A small open-sourced generalist vision language action model for embodied tasks
★
Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot
scenarios, demonstrating impressive task execution and reasoning capabilities. However, a
significant challenge arises from the limitations of visual encoding, which can result in failures
during tasks such as object grasping. Moreover, these models typically suffer from high
computational overhead due to their large sizes, often exceeding 7B parameters. While these models
excel in reasoning and task planning, the substantial computational overhead they incur makes them
impractical for real-time robotic environments, where speed and efficiency are paramount. To address
the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce
computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B
multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance
visual reasoning and action grounding. Additionally, our \model{} is trained on 970k real-world
robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation.
Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving
better task performance with significantly reduced computational overhead, making it a more
practical solution for real-time robotic autonomy.
Not all votes count! Translated program for verification improves self-consistency of language models for math reasoning
Large language models (LLMs) have become increasingly capable of solving mathematical reasoning
problems. However, many open-source LLMs still encounter issues with calculation errors and semantic
misunderstandings during intermediate reasoning steps. In this work, we present Prove, a simple yet
effective framework that leverages translated Python programs derived from natural language
solutions as a verification mechanism. This verification mechanism helps identify and filter out
potentially incorrect paths before final answers are aggregated. Unlike basic majority voting, our
approach rejects solutions whose program outputs do not align with the generated solution, only
aggregating those that pass the verification step. We conducted extensive experiments with 13
open-source LLMs of various model sizes across eight mathematical benchmarks, demonstrating
consistent improvements over baselines.
Pixel-level reasoning segmentation via multi-turn conversations
Existing visual perception systems focus on region-level segmentation in single-turn dialogues,
relying on complex and explicit query instructions. Such systems cannot reason at the pixel level
and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by
introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn
conversations, tracking evolving user intent via multi-turn interactions for fine-grained
segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng
Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k
multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose
MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level
segmentation with robust multi-turn conversation understanding, generating pixel-grounded
explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in
pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our
method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based
reasoning metrics. The code and data are available at: https://anonymous.4open.science/r/PixelRS/.
PREMISE: Matching-based Prediction for Accurate Review Recommendation
We present PREMISE, a new architecture for the matching-based learning in the multimodal fields for
the MRHP task. Distinct to previous fusion-based methods which obtains multimodal representations
via cross-modal attention for downstream tasks, PREMISE computes the multi-scale and multi-field
representations, filters duplicated semantics, and then obtained a set of matching scores as feature
vectors for the downstream recommendation task. This new architecture significantly boosts the
performance for such multimodal tasks whose context matching content are highly correlated to the
targets of that task, compared to the state-of-the-art fusion-based methods. Experimental results on
two publicly available datasets show that PREMISE achieves promising performance with less
computational cost.
PROEMO: Prompt-driven text-to-speech synthesis based on emotion and intensity control
Speech synthesis has significantly advanced from statistical methods to deep neural network
architectures, leading to various text-to-speech (TTS) models that closely mimic human speech
patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging.
To address this challenge, we introduce an approach centered on prompt-based emotion control. The
proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore,
we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic
content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic
variations with prompts, our approach infuses synthesized speech with human-like expressiveness and
variability. Lastly, we demonstrate the effectiveness of our approach through a systematic
exploration of the control mechanisms mentioned above.
PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference
As large language models (LLMs) tackle increasingly complex tasks and longer documents, their
computational and memory costs during inference become a major bottleneck. To address this, we
propose PromptDistill, a novel, training-free method that improves inference efficiency while
preserving generation quality. PromptDistill identifies and retains the most informative tokens by
leveraging attention interactions in early layers, preserving their hidden states while reducing the
computational burden in later layers. This allows the model to focus on essential contextual
information without fully processing all tokens. Unlike previous methods such as H2O and SnapKV,
which perform compression only after processing the entire input, or GemFilter, which selects a
fixed portion of the initial prompt without considering contextual dependencies, PromptDistill
dynamically allocates computational resources to the most relevant tokens while maintaining a global
awareness of the input. Experiments using our method and baseline approaches with base models such
as LLaMA 3.1 8B Instruct, Phi 3.5 Mini Instruct, and Qwen2 7B Instruct on benchmarks including
LongBench, InfBench, and Needle in a Haystack demonstrate that PromptDistill significantly improves
efficiency while having minimal impact on output quality compared to the original models. With a
single-stage selection strategy, PromptDistill effectively balances performance and efficiency,
outperforming prior methods like GemFilter, H2O, and SnapKV due to its superior ability to retain
essential information. Specifically, compared to GemFilter, PromptDistill achieves an overall $1\%$
to $5\%$ performance improvement while also offering better time efficiency. Additionally, we
explore multi-stage selection, which further improves efficiency while maintaining strong generation
performance.
Reward-Guided Tree Search for Inference Time Alignment of Large Language Models
Inference-time computation methods enhance the performance of Large Language Models (LLMs) by
leveraging additional computational resources to achieve superior results. Common techniques, such
as Best-of-N sampling, Majority Voting, and variants of tree-search algorithm have proven to be
effective in boosting the performance of LLMs. These approaches strategically trade increased
computational resource for improved model responses. In this work, we proposed DARWIN, an
inference-time alignment method that leverage the guidance of a reward model to achieve alignment
through reward-guided tree search. Empirical evidences indicates that our method outperform other
inference-time alignment methods such as Best-of-N and ARGS on two widely accepted alignment
benchmarks AlpacaEval 2 and MT-Bench. Furthermore, we show that our inference-time approach achieves
performance comparable to preference-tuned models on both benchmarks, highlighting the effectiveness
of trading inference-time compute for enhanced performance during inference.
The ACM Multimedia 2025 Grand Challenge of Multimodal Conversational Aspect-based Sentiment Analysis
Understanding fine-grained sentiment dynamics in human conversations is central for next-generation
AI, especially when interactions are rich in modalities and context. To advance research, we
organized the MCABSA challenge and introduce two subtasks: Panoptic Sentiment Sextuple Extraction
and Sentiment Flipping Analysis. To support these tasks, we present the PanoSent dataset, a
high-quality benchmark annotated across text, image, audio, and video modalities, and summarize top
systems and findings.
The jumping reasoning curve? Tracking the evolution of reasoning performance in GPT-[n] and o-[n] models on multimodal puzzles
The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm
shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have
demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for
Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns,
whereas humans often perceive and reason about multimodal scenarios involving both vision and
language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in
multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models
(including o1, o3, and o4-mini) on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA,
which demand fine-grained visual perception. Our results reveal that o-[n] series, particularly
later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong
scalability in multimodal reasoning. Nonetheless, despite these substantial advancements and the
superior capabilities demonstrated by the o-[n] series, our findings highlight that even these
leading models face persistent challenges. Difficulties are particularly evident in tasks requiring
precise visual perception, robust compositional reasoning across multiple visual attributes, and
solving complex algorithmic or highly combinatorial puzzles, indicating critical areas for future
AGI development. We plan to continuously track new models in the series and update our results in
this paper accordingly. All resources used in this evaluation are openly available at
https://github.com/declare-lab/LLM-PuzzleTest.
Toward robust multimodal sentiment analysis using multimodal foundational models
This paper investigates robust approaches to multimodal sentiment analysis using multimodal
foundational models. We analyze model robustness under modality corruption, distributional shifts,
and noisy real-world inputs, proposing training and adaptation strategies that substantially improve
performance across benchmarks.
Training vision-language process reward models for test-time scaling in multimodal reasoning: Key insights and lessons learned
Process Reward Models (PRMs) provide step-level supervision that improves the reliability of
reasoning in large language models. While PRMs have been extensively studied in text-based domains,
their extension to Vision Language Models (VLMs) remains limited. Existing Vision-Language PRMs
(VL-PRMs) rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce
noisy supervision signals and limit generalization across tasks. In this work, we aim to elucidate
the design space of VL-PRMs by exploring diverse strategies for dataset construction, training, and
test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with
judgments from a strong VLM, producing more accurate step-level labels. Second, we propose
perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding
stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing
that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments covering five
diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) reveal
several key insights: (i) VL-PRMs when used as Outcome Reward Models (ORMs) during test-time scaling
(TTS) can outperform VL-PRM guided process step selection, (ii) smaller VL-PRMs can match or even
surpass larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning abilities in
stronger VLM backbones, (iv) perception-level supervision leads to significant gains in test-time
scaling, and (v) TTS performance of different policies improve on advanced math reasoning datasets
despite not training VL-PRMs on such datasets. We hope our work will motivate further research and
support the advancement of VLMs.
Why AI is weird and should not be this way: Towards AI for everyone, with everyone, by everyone
★
This paper presents a vision for creating AI systems that are inclusive at every stage of
development, from data collection to model design and evaluation. We address key limitations in the
current AI pipeline and its WEIRD representation, such as lack of data diversity, biases in model
performance, and narrow evaluation metrics. We also focus on the need for diverse representation
among the developers of these systems, as well as incentives that are not skewed toward certain
groups. We highlight opportunities to develop AI systems that are for everyone (with diverse
stakeholders in mind), with everyone (inclusive of diverse data and annotators), and by everyone
(designed and developed by a globally diverse workforce).
2024
Can-do! A dataset and neuro-symbolic grounded framework for embodied planning with large multimodal models
Abstract:Large multimodal models have demonstrated impressive problem-solving abilities in vision
and language tasks, and have the potential to encode extensive world knowledge. However, it remains
an open challenge for these models to perceive, reason, plan, and act in realistic environments. In
this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities
through more diverse and complex scenarios than previous datasets. Our dataset includes 400
multimodal samples, each consisting of natural language user instructions, visual images depicting
the environment, state changes, and corresponding action plans. The data encompasses diverse aspects
of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis
reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception,
comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a
neurosymbolic framework that first grounds the plan generation in the perceived environment states
and then leverages symbolic planning engines to augment the model-generated plans. Experimental
results demonstrate the effectiveness of our framework compared to strong baselines. Our code and
dataset are available at this https URL.
Chain-of-Knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources
★
We present chain-of-knowledge (CoK), a novel framework that augments large language models (LLMs)
by dynamically incorporating grounding information from heterogeneous sources. It results in more
factual rationales and reduced hallucination in generation. Specifically, CoK consists of three
stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a
knowledge-intensive question, CoK first prepares several preliminary rationales and answers while
identifying the relevant knowledge domains.If there is no majority consensus among the answers from
samples, CoK corrects the rationales step by step by adapting knowledge from the identified
domains.These corrected rationales can plausibly serve as a better foundation for the final answer
consolidation.Unlike prior studies that primarily use unstructured data, CoK also leverages
structured knowledge sources such as Wikidata and tables that provide more reliable factual
information.To access both unstructured and structured knowledge sources in the dynamic knowledge
adapting stage, we propose an adaptive query generator that allows the generation of queries for
various types of query languages, including SPARQL, SQL, and natural sentences. Moreover, to
minimize error propagation between rationales, CoK corrects the rationales progressively using
preceding corrected rationales to generate and correct subsequent rationales.Extensive experiments
show that CoK consistently improves the performance of LLMs on knowledge-intensive tasks across
different domains.
CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models
Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and
audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for
achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in
Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up
inference by approximating denoising distributions, but this introduces issues with model
convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture
grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models,
CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or
pre-trained model dependencies. We further design weighted samplers to incorporate different
sampling positions into model training with dynamic probabilities, ensuring unbiased learning
throughout the entire training process. We present a real-time mel-spectrogram generation
consistency model, validated through comprehensive evaluations. Experimental results underscore
CM-TTS’s superiority over existing single-step speech synthesis systems, representing a significant
advancement in the field.
Consistency Guided Knowledge Retrieval and Denoising in LLMs for Zero-shot Document-level Relation Triplet Extraction
★
Document-level Relation Triplet Extraction (DocRTE) is a fundamental task in information systems
that aims to simultaneously extract entities with semantic relations from a document. Existing
methods heavily rely on a substantial amount of fully labeled data. However, collecting and
annotating data for newly emerging relations is time-consuming and labor-intensive. Recent advanced
Large Language Models (LLMs), such as ChatGPT and LLaMA, exhibit impressive long-text generation
capabilities, inspiring us to explore an alternative approach for obtaining auto-labeled documents
with new relations. In this paper, we propose a Zero-shot Document-level Relation Triplet Extraction
(ZeroDocRTE) framework, which Generates labeled data by Retrieval and Denoising Knowledge from LLMs,
called GenRDK. Specifically, we propose a chain-of-retrieval prompt to guide ChatGPT to generate
labeled long-text data step by step. To improve the quality of synthetic data, we propose a
denoising strategy based on the consistency of cross-document knowledge. Leveraging our denoised
synthetic data, we proceed to fine-tune the LLaMA2-13B-Chat for extracting document-level relation
triplets. We perform experiments for both zero-shot document-level relation and triplet extraction
on two public datasets. The experimental results illustrate that our GenRDK framework outperforms
strong baselines.
Content extraction based on hop distance within a graph model
A method of categorizing text entries on a document can include determining, for each of a
plurality of text bounding boxes in the document, respective text, respective coordinates, and
respective input embeddings. The method may further include defining a graph of the plurality of
bounding boxes, the graph comprising a plurality of connections among the plurality of bounding
boxes, each connection comprising a first and second bounding box and zero or more respective
intermediate bounding boxes. The method may further include determining a respective attention value
for each connection according to a quantity of intermediate bounding boxes in the connection and,
based on the respective attention values and a transformer-based machine learning model applied to
the respective input embeddings and respective coordinates, determining output embeddings for each
bounding box and, based on the respective output embeddings, generating a bounding box label for
each bounding box.
DELLA-Merging: Reducing interference in model merging through magnitude-based sampling
★
Abstract:With the proliferation of domain-specific models, model merging has emerged as a set of
techniques that combine the capabilities of multiple models into one that can multitask without the
cost of additional training. In this paper, we propose a new model merging technique, Drop and
rEscaLe via sampLing with mAgnitude (DELLA-Merging), that employs a novel pruning technique,
MAGPRUNE, which shows significant advantages over DARE and TIES. MAGPRUNE first ranks the parameters
in order of their magnitude and assigns higher dropout probabilities (p) to parameters with lower
ranks corresponding to lower magnitudes. To approximate the original embeddings, MAGPRUNE employs a
rescaling operation on the parameters that survive the random dropping by 1/(1 - p). On three
different expert models considered for merging (LM, Math, Code) and corresponding benchmark datasets
(AlpacaEval, GSM8K, MBPP), DELLA shows an average improvement of 2.4 points over baseline methods
employing delta parameter pruning (an improvement of 3.6 points over TIES, 1.2 points over DARE),
and 11.1 points over the no-pruning baseline (TA). We release the source code at: this https URL.
Domain-expanded ASTE: Rethinking generalization in aspect sentiment triplet extraction
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to
provide fine-grained insights into human sentiments. However, existing benchmarks are limited to two
domains and do not evaluate model performance on unseen domains, raising concerns about the
generalization of proposed methods. Furthermore, it remains unclear if large language models (LLMs)
can effectively handle complex sentiment tasks like ASTE. In this work, we address the issue of
generalization in ASTE from both a benchmarking and modeling perspective. We introduce a
domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models
in both in-domain and out-of-domain settings. Additionally, we propose CASE, a simple and effective
decoding strategy that enhances trustworthiness and performance of LLMs in ASTE. Through
comprehensive experiments involving multiple tasks, settings, and models, we demonstrate that CASE
can serve as a general decoding strategy for complex sentiment tasks. By expanding the scope of
evaluation and providing a more reliable decoding strategy, we aim to inspire the research community
to reevaluate the generalizability of benchmarks and models for ASTE. Our code, data, and models are
available at https://github.com/DAMO-NLP-SG/domain-expanded-aste.
Hate speech detection: A comprehensive review of recent works
★
This review surveys recent advances in hate speech detection, covering datasets, annotation
practices, model architectures, and evaluation metrics. We discuss challenges including implicit
bias, multilinguality, and context-dependence, and outline directions for robust, fair, and
explainable hate speech detection systems.
HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks
Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain
to the speech domain. While developing TTS architectures that
Improving text-to-audio models with synthetic captions
Abstract:It is an open challenge to obtain high quality training data, especially captions, for
text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to
augment and improve captions, such methods have limitations related to scale and coherence between
audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio
language model} to synthesize accurate and diverse captions for audio at scale. We leverage this
pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and
then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through
systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic
captions leads to significant improvements on audio generation quality, achieving a new
\textit{state-of-the-art}.
INSTRAUG: Automatic Instruction Augmentation for Multimodal Instruction Fine-tuning
Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven
to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent
works about high-quality instruction-following data generation and selection require amounts of
human labor to conceive model-understandable instructions for the given tasks and carefully filter
the LLM-generated data. In this work, we introduce an automatic instruction augmentation method
named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta
instructions but can expand an instruction-following dataset by 30 times. Results on two popular
multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can
significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal
tasks, which is even equivalent to the benefits of scaling up training data multiple times.
InstructEval: Towards holistic evaluation of instruction-tuned large language models
★
Instruction-tuned large language models have revolutionized natural language processing and have
shown great potential in applications such as conversational agents. These models, such as GPT-4,
can not only master language but also solve complex tasks in areas like mathematics, coding,
medicine, and law. However, there is still a lack of comprehensive understanding regarding their
full potential, primarily due to the black-box nature of many models and lack of holistic
evaluation. To address these challenges, we present InstructEval, a more comprehensive evaluation
suite designed specifically for instruction-tuned large language models. Unlike previous works, our
evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and
alignment to human values. We take a holistic approach to analyze various factors affecting model
performance, including the pretraining foundation, instruction-tuning data, and training methods.
Our findings reveal that the quality of instruction data is a crucial factor in scaling model
performance. While open-source models demonstrate impressive writing abilities, there is substantial
room for improvement in problem-solving and alignment.
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
★
We propose RESTA to perform LLM realignment towards safety, which gets compromised due to
downstream task fine-tuning. RESTA stands for REstoring Safety through Task Arithmetic. At its core,
it involves a simple arithmetic addition of a safety vector to the weights of the compromised model.
We demonstrate the effectiveness of RESTA in both parameter-efficient and full fine-tuning, covering
a wide range of downstream tasks, including instruction following in Chinese, English, and Hindi, as
well as problem-solving capabilities in Code and Math. We also showcase the generalizability of
RESTA on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed
as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5
sub-categories of harm. Overall, RESTA decreases the harmfulness of the compromised model from 18.6%
to 5.1% and from 9.2% to 1.5% in parameter-efficient and full fine-tuning, respectively, while
maintaining most of the model’s performance on the task. We release the source codes at:
https://github.com/declare-lab/resta.
Large language models for automated open-domain scientific hypotheses discovery
★
Hypothetical induction is recognized as the main reasoning type when scientists make observations
about the world and try to propose hypotheses to explain those observations. Past research on
hypothetical induction is under a constrained setting: (1) the observation annotations in the
dataset are carefully manually handpicked sentences (resulting in a close-domain setting); and (2)
the ground truth hypotheses are mostly commonsense knowledge, making the task less challenging. In
this work, we tackle these problems by proposing the first dataset for social science academic
hypotheses discovery, with the final goal to create systems that automatically generate valid,
novel, and helpful scientific hypotheses, given only a pile of raw web corpus. Unlike previous
settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and
(2) proposing hypotheses even new to humanity. A multi-module framework is developed for the task,
including three different feedback mechanisms to boost performance, which exhibits superior
performance in terms of both GPT-4 based and expert-based evaluation.To the best of our knowledge,
this is the first work showing that LLMs are able to generate novel (”not existing in literature”)
and valid (”reflecting reality”) scientific hypotheses.
Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation
Abstract:Different languages have distinct phonetic systems and vary in their prosodic features
making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech
in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture
nuances in multiple languages and efficient enough to be practical for deployment. The standard
approach is to build transformer based model such as SpeechT5 and train it on large multilingual
dataset. As the size of these models grow the conventional fine-tuning for adapting these model
becomes impractical due to heavy computational cost. In this paper, we proposes to integrate
parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS
architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to
achieve comparable or even better performance compared to full fine-tuning with only $\sim$2.5\%
tunable this http URL code and samples are available at: this https URL.
Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents
Documents that consist of diverse templates and exhibit complex spatial structures pose a challenge
for document entity classification. We propose KNN-Former, which incorporates a new kind of spatial
bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities. We
limit entities’ attention only to their local radius defined by the KNN graph. We also use
combinatorial matching to address the one-to-one mapping property that exists in many documents,
where one field has only one corresponding entity. Moreover, our method is highly
parameter-efficient compared to existing approaches in terms of the number of trainable parameters.
Despite this, experiments across various datasets show our method outperforms baselines in most
entity types. Many real-world documents exhibit combinatorial properties which can be leveraged as
inductive biases to improve extraction accuracy, but existing datasets do not cover these documents.
To facilitate future research into these types of documents, we release a new ID document dataset
that covers diverse templates and languages. We also release enhanced annotations for an existing
dataset.
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis
While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and
advancement, there are still gaps in defining a more holistic research target seamlessly integrating
multimodality, conversation context, fine-granularity, and also covering the changing sentiment
dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a
multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment
Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale
from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the
dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark
the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring
high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both
implicit&explicit sentiment elements. To effectively address the tasks, we devise a novel
Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model
(namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate
the superiority of our methods over strong baselines, validating the efficacy of all our proposed
methods. The work is expected to open up a new era for the ABSA community, and thus all our codes
and data are open at https://PanoSent.github.io/.
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns
★
Large multimodal models extend the impressive capabilities of large language models by integrating
multimodal understanding abilities. However, it is not clear how they can emulate the general
intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are
key to general intelligence, we introduce PuzzleVQA, a collection of 2000 puzzle instances based on
abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns
based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments
on state-of-the-art large multimodal models, we find that they are not able to generalize well to
simple abstract patterns. Notably, GPT-4V achieves a score of 46.4% on single-concept puzzles, which
shows that state-of-the-art models struggle on our dataset. To diagnose the reasoning challenges in
large multimodal models, we progressively guide the models with our ground truth reasoning
explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic
analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive
reasoning abilities. Through this work, we hope to shed light on the limitations of large multimodal
models and how they can better emulate human cognitive processes in the future.
Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths
Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through
step-by-step reasoning. However, they may still falter on more complex problems, making errors that
disrupt their reasoning paths. We attribute this to the expansive solution space, where each step
has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a
specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to
reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning
step while penalizing unfavorable ones, enhancing the model’s overall problem-solving performance.
Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from
closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks,
such as math word problems and science-based exam questions. The experiments demonstrate that our
framework significantly enhances the reasoning performance of large language models, with up to 3.1%
and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at
https://reasoning-paths.github.io.
Reward Steering with Evolutionary Heuristics for Decoding-time Alignment
Abstract:Inference-time computation methods enhance the performance of Large Language Models (LLMs)
by leveraging additional computational resources to achieve superior results. Common techniques,
such as Best-of-N sampling, Majority Voting, and variants of tree-search algorithms have proven to
be effective in boosting the performance of LLMs. These approaches strategically trade increased
computational resources for improved model responses. In this work, we proposed DARWIN, an
inference-time alignment method that leverages the guidance of a reward model to achieve alignment
through a reward-guided tree search. Empirical evidences indicates that our method outperforms other
inference-time alignment methods such as Best-of-N and ARGS on two widely accepted alignment
benchmarks AlpacaEval 2 and MT-Bench. Furthermore, we show that our inference-time approach achieves
performance comparable to preference-tuned models on both benchmarks, highlighting the effectiveness
of trading inference-time compute for enhanced performance during inference. We have released our
codes at this https URL.
Ruby teaming: Improving quality diversity search with memory for automated red teaming
Abstract:We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory
cache as its third dimension. The memory dimension provides cues to the mutator to yield
better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt
archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms
of quality diversity, Ruby Teaming outperforms Rainbow Teaming by 6% and 3% on Shannon's
Evenness Index (SEI) and Simpson's Diversity Index (SDI), respectively.
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
Ensuring the safe alignment of large language models (LLMs) with human values is critical as they
become integral to applications like translation and question answering. Current alignment methods
struggle with dynamic user intentions and complex objectives, making models vulnerable to generating
harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across
different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety
Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote
safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that
could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic
significantly improves safety measures, reduces over-safety, and maintains model utility,
outperforming existing methods in ensuring safe content generation.
Self-adaptive sampling for accurate video question answering on image text models
Image–text models (ITMs) is the prevalent architecture to solve video question–answering tasks,
which requires only a few input frames to save huge computational cost compared to video–language
models.However, we find existent ITM video question–answering solutions either 1) adopt simplistic
and unintentional sampling strategies, which may miss key frames to offer the answer clues; or 2)
sample a large number of frames into divided groups, which the computational sources can not
accommodate. In this work, we aim at an efficient sampling method towards the few-frame
situations.We first summarize a family of prior sampling methods based on question–frame correlation
into a unified one, dubbed *Most Implied Frames* (MIF). Through some primary results and analysis,
Through analysis, we form a hypothesis that question-aware sampling is not necessary, from which we
further propose the other method *Most Dominant Frames* (MDF).Experimental results on four public
datasets and three advanced ITMs demonstrate that our proposed strategies can boost the performance
for image–text pretrained models, and have a wide application scenario in terms of model
architectures and dataset types. Our code is available at https://github.com/declare-lab/Sealing
https://github.com/declare-lab/Sealing .
Sowing the wind, reaping the whirlwind: The impact of editing language models
In the rapidly advancing field of artificial intelligence, the concept of ‘Red-Teaming’ or
‘Jailbreaking’ large language models (LLMs) has emerged as a crucial area of study. This approach is
especially significant in terms of assessing and enhancing the safety and robustness of these
models. This paper investigates the intricate consequences of such modifications through model
editing, uncovering a complex relationship between enhancing model accuracy and preserving its
ethical integrity. Our in-depth analysis reveals a striking paradox: while injecting accurate
information is crucial for model reliability, it can paradoxically destabilize the model’s
foundational framework, resulting in unpredictable and potentially unsafe behaviors. Additionally,
we propose a benchmark dataset NicheHazardQA to investigate this unsafe behavior both within the
same and cross topical domain. This aspect of our research sheds light on how the edits, impact the
model’s safety metrics and guardrails. Our findings show that model editing serves as a
cost-effective tool for topical red-teaming by methodically applying targeted edits and evaluating
the resultant model behavior.
Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization
★
Generative multimodal content is increasingly prevalent in much of the content creation arena, as
it has the potential to allow artists and media personnel to create pre-production mockups by
quickly bringing their ideas to life. The generation of audio from text prompts is an important
aspect of such processes in the music and film industry. Many of the recent diffusion-based
text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of
datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or
events and their temporal ordering in the output audio with respect to the input prompt. Our
hypothesis is focusing on how these aspects of audio generation could improve audio generation
performance in the presence of limited data. As such, in this work, using an existing text-to-audio
model Tango, we synthetically create a preference dataset where each prompt has a winner audio
output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in
theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the
publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization)
loss on our preference dataset and show that it leads to improved audio output over Tango and
AudioLDM2, in terms of both automatic- and manual-evaluation metrics.
Towards robust instruction tuning on multimodal large language models
Abstract:Fine-tuning large language models (LLMs) on multi-task instruction-following data has been
proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks.
Recent works about high-quality instruction-following data generation and selection require amounts
of human labor to conceive model-understandable instructions for the given tasks and carefully
filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation
method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward
meta instructions but can expand an instruction-following dataset by 30 times. Results on two
popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG
can significantly improve the alignment of multimodal large language models (MLLMs) across 12
multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple
times.
Two are better than one: Context window extension with multi-grained self-injection
The limited context window of contemporary large language models (LLMs) remains a huge barrier to
their broader application. We propose SharedLLM, a novel approach based on multi-grained context
compression and query-aware information retrieval. SharedLLM composes two stacked short-context
LLMs: a lower model acting as a compressor and an upper model as a decoder. The lower model
compresses long inputs into compact, multi-grained representations forwarded to the upper model,
enabling efficient context-aware processing. A specialized tree-style data structure and retrieval
algorithm enable rapid encoding and lookup of multi-grained contextual information. This approach
reduces memory footprint and yields inference speedups while generalizing to very long inputs.
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense
★
Large language models (LLMs) have demonstrated substantial commonsense understanding through
numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely
unexamined. In this paper, we conduct a comprehensive examination of the capabilities and
limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using
several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant
discrepancy in performance when tested on culture-specific commonsense knowledge for different
cultures; (2) LLMs’ general commonsense capability is affected by cultural context; and (3) The
language used to query the LLMs can impact their performance on cultural-related tasks.Our study
points to the inherent bias in the cultural understanding of LLMs and provides insights that can
help develop culturally-aware language models.
Understanding, Leveraging, and Improving Large Language Models
The emergence of Large Language Models (LLMs) has marked a substantial advancement in Natural
Language Processing (NLP), contributing significantly to enhanced task performance both within and
outside specific domains. However, amidst these achievements, three key questions remain unanswered:
1) The mechanism through which LLMs accomplish their tasks and their limitations, 2) Effectively
harnessing the power of LLMs across diverse domains, and 3) Strategies for enhancing the performance
of LLMs. This talk aims to delve into our research group's endeavors to address these pivotal
questions. Firstly, I will outline our approach, which involves utilizing ontology-guided prompt
perturbations to unravel the primary limitations of LLMs in solving mathematical problems. Moving on
to the second question, we will explore the utilization of synthetic data generated by LLMs to
bolster challenging downstream tasks, particularly focusing on structured prediction where LLMs face
persistent challenges. I will elaborate on our initiatives aimed at improving LLMs by incorporating
highly effective retrieval strategies, specifically addressing the prevalent challenge of
hallucinations that often plagues contemporary LLMs. Finally, I will present a technique on LLM
realignment to restore safety lost during fine-tuning.
Video2music: Suitable music generation from videos using an affective multimodal transformer model
★
We develop Video2Music, a generative music framework that matches provided videos by extracting
semantic, scene, motion, and emotion features from curated music videos. Audio is transcribed into
MIDI/chords and features like note density and loudness are extracted; an Affective Multimodal
Transformer (AMT) is trained on the MuVi-Sync dataset to generate music conditioned on video
features with an affective-similarity mechanism. A post-processing biGRU-based regressor estimates
note density and loudness to render dynamic output. User studies confirm that generated music
matches video emotion and quality.
Walledeval: A comprehensive safety evaluation toolkit for large language models
WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models
(LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones,
and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated
safety, and prompt injections. The framework supports both LLM and judge benchmarking, and
incorporates custom mutators to test safety against various text-style mutations such as future
tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant
content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural
contexts. We make WalledEval publicly available at https://github.com/walledai/walledeval with a
demonstration video at https://youtu.be/50Zy97kj1MA.
2023
A Comprehensive Survey of Sentence Representations: From the BERT Epoch to the ChatGPT Era and Beyond
Sentence representations are a critical component in NLP applications such as retrieval, question
answering, and text classification. They capture the meaning of a sentence, enabling machines to
understand and reason over human language. In recent years, significant progress has been made in
developing methods for learning sentence representations, including unsupervised, supervised, and
transfer learning approaches. However there is no literature review on sentence representations till
now. In this paper, we provide an overview of the different methods for sentence representation
learning, focusing mostly on deep learning models. We provide a systematic organization of the
literature, highlighting the key contributions and challenges in this area. Overall, our review
highlights the importance of this area in natural language processing, the progress made in sentence
representation learning, and the challenges that remain. We conclude with directions for future
research, suggesting potential avenues for improving the quality and efficiency of sentence
representations.
A Review of Deep Learning Techniques for Speech Processing
★
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable of
extracting intricate features from speech data. This development has paved the way for unparalleled
advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented heights. The power
of deep learning techniques has opened up new avenues for research and innovation in the field of
speech processing, with far-reaching implications for a range of industries and applications. This
review paper provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of speech processing
research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning
architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the
approaches and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in
the literature and describe how different deep-learning networks have been utilized to tackle these
tasks. Additionally, we discuss the challenges and future directions of deep learning in speech
processing, including the need for more parameter-efficient, interpretable models and the potential
of deep learning for multimodal speech processing. By examining the field's evolution, comparing and
contrasting different approaches, and highlighting future directions and challenges, we hope to
inspire further research in this exciting and rapidly advancing field.
A Robust Information-Masking Approach for Domain Counterfactual Generation
Domain shift is a big challenge in NLP. Many approaches, thus, resort to learning domain-invariant
features to mitigate the hurdles of domain shift during inference. Such methods, however, inexorably
fail to leverage the domain-specific nuances relevant to the task at hand. To avoid such drawbacks,
domain counterfactual generation has recently been proposed that aims to transform a text from the
source domain to a given target domain. To achieve this, the existing method uses a frequency-based
approach to identify and mask the source-domain-specific tokens in a text. A pretrained LM is then
prompted to fill the masks with target-domain-specific tokens. We, however, have observed that, due
to limitations of the available data, such a frequency-based method may either miss some
domain-token associations or lead to some spurious domain-token associations. To this end, we
additionally employ attention norm-based scores to identify additional token-domain associations
from a domain classifier. To minimize spurious associations, we also devise an iterative unmasking
heuristic that unmasks the masked tokens to minimize the confidence of a domain classifier in the
source domain. Our experiments empirically show that the counterfactual samples sourced from our
masked text lead to improved domain transfer across various classification tasks. The proposed
approach outperforms the baselines on 10 out of 12 domain-counterfactual classification settings
with an average of 1.7% improvement in accuracy metric.
Adapter Pruning using Tropical Characterization
Adapters are widely popular parameter-efficient transfer learning approaches in natural language
processing that insert trainable modules in between layers of a pre-trained language model. Apart
from several heuristics, however, there has been a lack of studies analyzing the optimal number of
adapter parameters needed for downstream applications. Thus, we propose an adapter pruning approach
by studying the tropical characteristics of trainable modules. We cast it as an optimization problem
that aims to prune parameters from the adapter layers without changing the orientation of underlying
tropical hypersurfaces. Our experiments on five NLP datasets show that tropical geometry tends to
identify more relevant parameters to prune when compared with the magnitude-based baseline, while a
combined approach works best across the tasks.
ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation
There are significant challenges for speaker adaptation in text-to-speech for languages that are
not widely spoken or for speakers with accents or dialects that are not well-represented in the
training data. To address this issue, we propose the use of the"mixture of adapters"method. This
approach involves adding multiple adapters within a backbone-model layer to learn the unique
characteristics of different speakers. Our approach outperforms the baseline, with a noticeable
improvement of 5% observed in speaker preference tests when using only one minute of data for each
new speaker. Moreover, following the adapter paradigm, we fine-tune only the adapter parameters (11%
of the total model parameters). This is a significant achievement in parameter-efficient speaker
adaptation, and one of the first models of its kind. Overall, our proposed approach offers a
promising solution to the speech synthesis techniques, particularly for adapting to speakers from
diverse backgrounds.
Contrastive chain-of-thought prompting
★
Abstract:Despite the success of chain of thought in enhancing language model reasoning, the
underlying process remains less well understood. Although logically sound reasoning appears
inherently crucial for chain of thought, prior studies surprisingly reveal minimal impact when using
invalid demonstrations instead. Furthermore, the conventional chain of thought does not inform
language models on what mistakes to avoid, which potentially leads to more errors. Hence, inspired
by how humans can learn from both positive and negative examples, we propose contrastive chain of
thought to enhance language model reasoning. Compared to the conventional chain of thought, our
approach provides both valid and invalid reasoning demonstrations, to guide the model to reason
step-by-step while reducing reasoning mistakes. To improve generalization, we introduce an automatic
method to construct contrastive demonstrations. Our experiments on reasoning benchmarks demonstrate
that contrastive chain of thought can serve as a general enhancement of chain-of-thought prompting.
Dialogue relation extraction with document-level heterogeneous graph attention networks
★
We propose a heterogeneous graph attention network to address the problem of dialogue relation
extraction. Compared with several popular sequence-based and graph-based models, our method shows
superior performance on the benchmark dataset DialogRE. The implementation of this work can be found
at https://github.com/declare-lab/dialog-HGAT Dialogue relation extraction aims to detect the
relation between pairs of entities mentioned in a multi-party dialogue. It plays an essential role
in understanding the deep logic of dialogues and facilitating the development of intelligent
dialogue systems. We introduce a heterogeneous graph attention network to model the cross-sentence
relations in a conversation. This heterogeneous graph attention network has modeled multi-type
features of the conversation, such as utterance, word, speaker, argument, and entity type
information. We compare our method with several popular baselines such as convolutional neural
networks and long short-term memory, experimental results show our model outperforms the
state-of-the-art method by 9.4%/7.8% F 1 scores, and 6.6%/3.9% $$F1_c$$ F 1 c scores in both
validation and test sets with only 4.0M parameters. In this work, we present an attention-based
heterogeneous graph network to deal with the dialogue relation extraction task in an inductive
manner. Experimental results on the dataset DialogRE confirm the effectiveness of our method.
Evaluating parameter-efficient transfer learning approaches on SURE benchmark for speech understanding
Fine-tuning is widely used as the default algorithm for transfer learning from pre-trained models.
Parameter inefficiency can however arise when, during transfer learning, all the parameters of a
large pre-trained model need to be updated for individual downstream tasks. As the number of
parameters grows, fine-tuning is prone to overfitting and catastrophic forgetting. In addition, full
fine-tuning can become prohibitively expensive when the model is used for many tasks. To mitigate
this issue, parameter-efficient transfer learning algorithms, such as adapters and prefix tuning,
have been proposed as a way to introduce a few trainable parameters that can be plugged into large
pre-trained language models such as BERT, HuBERT. In this paper, we introduce the Speech
UndeRstanding Evaluation (SURE) benchmark for parameter-efficient learning for various speech
processing tasks. Additionally, we introduce a new adapter, ConvAdapter, based on 1D convolution. We
show that ConvAdapter outperforms the standard adapters while showing comparable performance against
prefix tuning and Low-Rank Adaptation with only 0.94% of trainable parameters.
Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt
We have witnessed the rapid proliferation of multimodal data on numerous social media platforms.
Conventional studies typically require massive labeled data to train models for Multimodal
Aspect-Based Sentiment Analysis (MABSA). However, collecting and annotating fine-grained multimodal
data for MABSA is tough. To alleviate the above issue, we perform three MABSA-related tasks with
quite a small number of labeled multimodal samples. We first build diverse and comprehensive
multimodal few-shot datasets according to the data distribution. To capture the specific prompt for
each aspect term in a few-shot scenario, we propose a novel Generative Multimodal Prompt (GMP) model
for MABSA, which includes the Multimodal Encoder module and the N-Stream Decoders module. We further
introduce a subtask to predict the number of aspect terms in each instance to construct the
multimodal prompt. Extensive experiments on two datasets demonstrate that our approach outperforms
strong baselines on two MABSA-related tasks in the few-shot setting.
Few-shot multimodal sentiment analysis based on multimodal probabilistic fusion prompts
★
Multimodal sentiment analysis has gained significant attention due to the proliferation of
multimodal content on social media. However, existing studies in this area rely heavily on
large-scale supervised data, which is time-consuming and labor-intensive to collect. Thus, there is
a need to address the challenge of few-shot multimodal sentiment analysis. To tackle this problem,
we propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages
diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario.
Specifically, we start by introducing a Consistently Distributed Sampling approach called CDS, which
ensures that the few-shot dataset has the same category distribution as the full dataset. Unlike
previous approaches primarily using prompts based on the text modality, we design unified multimodal
prompts to reduce discrepancies between different modalities and dynamically incorporate multimodal
demonstrations into the context of each multimodal instance. To enhance the model's robustness, we
introduce a probabilistic fusion method to fuse output predictions from multiple diverse prompts for
each input. Our extensive experiments on six datasets demonstrate the effectiveness of our approach.
First, our method outperforms strong baselines in the multimodal few-shot setting. Furthermore,
under the same amount of data (1% of the full dataset), our CDS-based experimental results
significantly outperform those based on previously sampled datasets constructed from the same number
of instances of each class.
Flacuna: Unleashing the problem solving power of vicuna using flan fine-tuning
Abstract:Recently, the release of INSTRUCTEVAL has provided valuable insights into the performance
of large language models (LLMs) that utilize encoder-decoder or decoder-only architecture.
Interestingly, despite being introduced four years ago, T5-based LLMs, such as FLAN-T5, continue to
outperform the latest decoder-based LLMs, such as LLAMA and VICUNA, on tasks that require general
problem-solving skills. This performance discrepancy can be attributed to three key factors: (1)
Pre-training data, (2) Backbone architecture, and (3) Instruction dataset. In this technical report,
our main focus is on investigating the impact of the third factor by leveraging VICUNA, a large
language model based on LLAMA, which has undergone fine-tuning on ChatGPT conversations. To achieve
this objective, we fine-tuned VICUNA using a customized instruction dataset collection called
FLANMINI. This collection includes a subset of the large-scale instruction dataset known as FLAN, as
well as various code-related datasets and conversational datasets derived from ChatGPT/GPT-4. This
dataset comprises a large number of tasks that demand problem-solving skills. Our experimental
findings strongly indicate that the enhanced problem-solving abilities of our model, FLACUNA, are
obtained through fine-tuning VICUNA on the FLAN dataset, leading to significant improvements across
numerous benchmark datasets in INSTRUCTEVAL. FLACUNA is publicly available at this https URL.
kNN-CM: A Non-parametric Inference-Phase Adaptation of Parametric Text Classifiers
Semi-parametric models exhibit the properties of both parametric and non-parametric modeling and
have been shown to be effective in the next-word prediction language modeling task. However, there
is a lack of studies on the text-discriminating properties of such models. We propose an
inference-phase approach— k -Nearest Neighbor Classification Model ( k NN-CM)—that enhances the
capacity of a pre-trained parametric text classifier by incorporating a simple neighborhood search
through the representation space of (memorized) training samples. The final class prediction of k
NN-CM is based on the convex combination of probabilities obtained from k NN search and prediction
of the classifier. Our experiments show consistent performance improvements on eight SuperGLUE
tasks, three adversarial natural language inference (ANLI) datasets, 11 question-answering (QA)
datasets, and two sentiment classification datasets.
Language guided visual question answering: Elevate your multimodal language model using knowledge-enriched prompts
Visual question answering (VQA) is the task of answering questions about an image. The task assumes
an understanding of both the image and the question to provide a natural language answer. VQA has
gained popularity in recent years due to its potential applications in a wide range of fields,
including robotics, education, and healthcare. In this paper, we focus on knowledge-augmented VQA,
where answering the question requires commonsense knowledge, world knowledge, and reasoning about
ideas and concepts not present in the image. We propose a multimodal framework that uses language
guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more
accurately. We benchmark our method on the multi-choice question-answering task of the A-OKVQA,
Science-QA, VSR, and IconQA datasets using CLIP and BLIP models. We show that the use of language
guidance is a simple but powerful and effective strategy for visual question answering. Our language
guidance improves the performance of CLIP by 7.6% and BLIP-2 by 4.8% in the challenging A-OKVQA
dataset. We also observe consistent improvement in performance on the Science-QA, VSR, and IconQA
datasets when using the proposed language guidances. The implementation of LG-VQA is publicly
available at https://github.com/declare-lab/LG-VQA.
Language model unalignment: Parametric red-teaming to expose hidden harms and biases
Abstract:Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language
Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent
disregarding the harmfulness of the query. Existing methods are primarily based on input text-based
red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to
condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden
harmful information and biases in the model that are left untreated or newly introduced by its
safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low
attack success rate, and applicability to specific models. In this paper, we present a new
perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply
(instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the
model's behavior. Unalignment using as few as 100 examples can significantly bypass commonly
referred to as CHATGPT, to the point where it responds with an 88% success rate to harmful queries
on two safety benchmark datasets. On open-source models such as VICUNA-7B and LLAMA-2-CHAT 7B AND
13B, it shows an attack success rate of more than 91%. On bias evaluations, Unalignment exposes
inherent biases in safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's
responses are strongly biased and opinionated 64% of the time.
LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models
★
The success of large language models (LLMs), like GPT-4 and ChatGPT, has led to the development of
numerous cost-effective and accessible alternatives that are created by finetuning open-access LLMs
with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various
fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the
most attractive topics, as it only requires fine-tuning a few external parameters instead of the
entire LLMs while achieving comparable or even better performance. To enable further research on
PEFT methods of LLMs, this paper presents LLM-Adapters, an easy-to-use framework that integrates
various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different
tasks. The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, and GPT-J, as
well as widely used adapters such as Series adapters, Parallel adapter, Prompt-based learning and
Reparametrization-based methods. Moreover, we conduct extensive empirical studies on the impact of
adapter types, placement locations, and hyper-parameters to the best design for each adapter-based
methods. We evaluate the effectiveness of the adapters on fourteen datasets from two different
reasoning tasks, Arithmetic Reasoning and Commonsense Reasoning. The results demonstrate that using
adapter-based PEFT in smaller-scale LLMs (7B) with few extra trainable parameters yields comparable,
and in some cases superior, performance to powerful LLMs (175B) in zero-shot inference on simple
math reasoning datasets.
MM-BigBench: Evaluating multimodal models on multimodal content comprehension tasks
Abstract:The popularity of multimodal large language models (MLLMs) has triggered a recent surge in
research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of
MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting
performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond
multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound
understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final
answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which
incorporates a diverse range of metrics to offer an extensive evaluation of the performance of
various models and instructions across a wide spectrum of diverse multimodal content comprehension
tasks. Consequently, our work complements research on the performance of MLLMs in multimodal
comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we
employ the Best Performance metric to ascertain each model's performance upper bound on
different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall
performance of various models and instructions, while the Stability metric measures their
sensitivity. Furthermore, previous research centers on evaluating models independently or solely
assessing instructions, neglecting the adaptability between models and instructions. We propose the
Adaptability metric to quantify the adaptability between models and instructions. Our paper
evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with
10 instructions for each task, and derives novel insights. Our code will be released at this https
URL.
Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions
★
Sentiment analysis (SA) has gained much traction In the field of artificial intelligence (AI) and
natural language processing (NLP). There is growing demand to automate analysis of user sentiment
towards products or services. Opinions are increasingly being shared online in the form of videos
rather than text alone. This has led to SA using multiple modalities, termed Multimodal Sentiment
Analysis (MSA), becoming an important research area. MSA utilises latest advancements in machine
learning and deep learning at various stages including for multi-modal feature extraction and fusion
and sentiment polarity detection, with aims to minimize error rate and improve performance. This
survey paper examines primary taxonomy and newly released multimodal fusion architectures. Recent
developments in MSA architectures are divided into ten categories, namely early fusion, late fusion,
hybrid fusion, model-level fusion, tensor fusion, hierarchical fusion, bi-modal fusion,
attention-based fusion, quantum-based fusion and word-level fusion. A comparison of several
architectural evolutions in terms of MSA fusion categories and their relative strengths and
limitations are presented. Finally, a number of interdisciplinary applications and future research
directions are proposed.
Mustango: Toward controllable text-to-music generation
★
The quality of the text-to-music models has reached new heights due to recent advancements in
diffusion models. The controllability of various musical aspects, however, has barely been explored.
In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on
diffusion. Mustango aims to control the generated music, not only with general text captions, but
with more rich captions that can include specific instructions related to chords, beats, tempo, and
key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that
steers the generated music to include the music-specific conditions, which we predict from the text
prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the
limited availability of open datasets of music with text captions, we propose a novel data
augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music
audio and using state-of-the-art Music Information Retrieval methods to extract the music features
which will then be appended to the existing descriptions in text format. We release the resulting
MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in
the caption text. Through extensive experiments, we show that the quality of the music generated by
Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly
outperforms other models such as MusicGen and AudioLDM2.
Red-teaming large language models using chain of utterances for safety-alignment
★
Abstract:Larger language models (LLMs) have taken the world by storm with their massive
multi-tasking capabilities simply by optimizing over a next-word prediction objective. With the
emergence of their properties and encoded knowledge, the risk of LLMs producing harmful outputs
increases, making them unfit for scalable deployment for the public. In this work, we propose a new
safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that even widely deployed
models are susceptible to the Chain of Utterances-based (CoU) prompting, jailbreaking closed source
LLM-based systems such as GPT-4 and ChatGPT to unethically respond to more than 65% and 73% of
harmful queries. We also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in
generating harmful responses in more than 86% of the red-teaming attempts. Next, we propose
RED-INSTRUCT--An approach for the safety alignment of LLMs. It constitutes two phases: 1) HARMFULQA
data collection: Leveraging CoU prompting, we collect a dataset that consists of 1.9K harmful
questions covering a wide range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2)
SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the safety alignment of
LLMs by minimizing the negative log-likelihood over helpful responses and penalizing over harmful
responses by gradient accent over sample loss. Our model STARLING, a fine-tuned Vicuna-7B, is
observed to be more safely aligned when evaluated on RED-EVAL and HHH benchmarks while preserving
the utility of the baseline models (TruthfulQA, MMLU, and BBH).
Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding
We introduce SEGUE, a sentence-embedder-guided utterance encoder for spoken language understanding
that leverages pretrained sentence embeddings to improve utterance representations. SEGUE adapts
sentence-level context into utterance encoding, improving slot filling and intent detection on
conversational speech benchmarks while being parameter-efficient.
Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
★
The immense scale of the recent large language models (LLM) allows many interesting properties,
such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero-
and few-shot performance in many natural language processing (NLP) tasks. Inspired by such
successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio
(TTA) generation -- a task where the goal is to generate an audio from its textual description. The
prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned
model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms
the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test
set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen.
This improvement might also be attributed to the adoption of audio pressure level-based sound mixing
for training set augmentation, whereas the prior methods take a random mix.
UDApter: Efficient Domain Adaptation Using Adapters
We propose two methods to make unsupervised domain adaptation (UDA) more parameter efficient using
adapters – small bottleneck layers interspersed with every layer of the large-scale pre-trained
language model (PLM). The first method deconstructs UDA into a two-step process: first by adding a
domain adapter to learn domain-invariant information and then by adding a task adapter that uses
domain-invariant information to learn task representations in the source domain. The second method
jointly learns a supervised classifier while reducing the divergence measure. Compared to strong
baselines, our simple methods perform well in natural language inference (MNLI) and the cross-domain
sentiment classification task. We even outperform unsupervised domain adaptation methods such as
DANN and DSN in sentiment classification, and we are within 0.85% F1 for natural language inference
task, by fine-tuning only a fraction of the full model parameters. We release our code at this URL.
Uncertainty Guided Label Denoising for Document-level Distant Relation Extraction
Document-level relation extraction (DocRE) aims to infer complex semantic relations among entities
in a document. Distant supervision (DS) is able to generate massive auto-labeled data, which can
improve DocRE performance. Recent works leverage pseudo labels generated by the pre-denoising model
to reduce noise in DS data. However, unreliable pseudo labels bring new noise, e.g., adding false
pseudo labels and losing correct DS labels. Therefore, how to select effective pseudo labels to
denoise DS data is still a challenge in document-level distant relation extraction. To tackle this
issue, we introduce uncertainty estimation technology to determine whether pseudo labels can be
trusted. In this work, we propose a Document-level distant Relation Extraction framework with
Uncertainty Guided label denoising, UGDRE. Specifically, we propose a novel instance-level
uncertainty estimation method, which measures the reliability of the pseudo labels with overlapping
relations. By further considering the long-tail problem, we design dynamic uncertainty thresholds
for different types of relations to filter high-uncertainty pseudo labels. We conduct experiments on
two public datasets. Our framework outperforms strong baselines by 1.91 F1 and 2.28 Ign F1 on the
RE-DocRED dataset.
WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs
As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to
many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base
building, machine translation, text classification, and text summarization. In this paper, we
introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the
problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We
set up a two-phase summarization method - description generation (Phase I) and candidate ranking
(Phase II) - as a strong approach that relies on transfer and contrastive learning. For description
generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By
applying contrastive learning with the diverse input from beam search, the metric fusion-based
ranking models outperform the direct description generation models significantly up to 22 ROUGE in
topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II
are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the
gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot
effectively capture all sentiment polarities from paragraphs while doing this task better from the
gold descriptions. The automatic generation of new descriptions reduces the human efforts in
creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on
Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes
to be a useful dataset for related works in capturing salient information from short paragraphs. The
curated dataset is publicly available at: https://github.com/declare-lab/WikiDes.
2022
A dataset for hyper-relational extraction and a cube-filling approach
Relation extraction has the potential for large-scale knowledge graph construction, but current
methods do not consider the qualifier attributes for each relation triplet, such as time, quantity
or location. The qualifiers form hyper-relational facts which better capture the rich and complex
knowledge graph structure. For example, the relation triplet (Leonard Parker, Educated At, Harvard
University) can be factually enriched by including the qualifier (End Time, 1967). Hence, we propose
the task of hyper-relational extraction to extract more specific and complete facts from text. To
support the task, we construct HyperRED, a large-scale and general-purpose dataset. Existing models
cannot perform hyper-relational extraction as it requires a model to consider the interaction
between three entities. Hence, we propose CubeRE, a cube-filling model inspired by table-filling
approaches and explicitly considers the interaction between relation triplets and qualifiers. To
improve model scalability and reduce negative class imbalance, we further propose a cube-pruning
method. Our experiments show that CubeRE outperforms strong baselines and reveal possible directions
for future research. Our code and data are available at github.com/declare-lab/HyperRED.
Analyzing Modality Robustness in Multimodal Sentiment Analysis
★
Building robust multimodal models are crucial for achieving reliable deployment in the wild.
Despite its importance, less attention has been paid to identifying and improving the robustness of
Multimodal Sentiment Analysis (MSA) models. In this work, we hope to address that by (i) Proposing
simple diagnostic checks for modality robustness in a trained multimodal model. Using these checks,
we find MSA models to be highly sensitive to a single modality, which creates issues in their
robustness; (ii) We analyze well-known robust training strategies to alleviate the issues.
Critically, we observe that robustness can be achieved without compromising on the original
performance. We hope our extensive study–performed across five models and two benchmark datasets–and
proposed procedures would make robustness an integral component in MSA research. Our diagnostic
checks and robust training solutions are simple to implement and available at
https://github.com/declare-lab/MSA-Robustness
CICERO: A Dataset for Contextualized Commonsense Inference in Dialogues
★
This paper addresses the problem of dialogue reasoning with contextualized commonsense inference.
We curate CICERO, a dataset of dyadic conversations with five types of utterance-level
reasoning-based inferences: cause, subsequent event, prerequisite, motivation, and emotional
reaction. The dataset contains 53,105 of such inferences from 5,672 dialogues. We use this dataset
to solve relevant generative and discriminative tasks: generation of cause and subsequent event;
generation of prerequisite, motivation, and listener’s emotional reaction; and selection of
plausible alternatives. Our results ascertain the value of such dialogue-centric commonsense
knowledge datasets. It is our hope that CICERO will open new research avenues into commonsense-based
dialogue reasoning.
DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification
This paper proposes a simple yet effective interpolation-based data augmentation approach termed
DoubleMix, to improve the robustness of models in text classification. DoubleMix first leverages a
couple of simple augmentation operations to generate several perturbed samples for each training
data, and then uses the perturbed data and original data to carry out a two-step interpolation in
the hidden space of neural models. Concretely, it first mixes up the perturbed data to a synthetic
sample and then mixes up the original data and the synthetic perturbed data. DoubleMix enhances
models’ robustness by learning the “shifted” features in hidden space. On six text classification
benchmark datasets, our approach outperforms several popular text augmentation methods including
token-level, sentence-level, and hidden-level data augmentation techniques. Also, experiments in
low-resource settings show our approach consistently improves models’ performance when the training
data is scarce. Extensive ablation studies and case studies confirm that each component of our
approach contributes to the final performance and show that our approach exhibits superior
performance on challenging counterexamples. Additionally, visual analysis shows that text features
generated by our approach are highly interpretable.
Exemplars-guided empathetic response generation controlled by the elements of human communication
Empathy is fundamental to humans among other animals. It is key to strengthening social cohesion,
the cornerstone of health and success of societies. Thus, empathy could be an important component of
effective human-computer interactions through conversations. This has motivated a whole sub-field of
research focused on empathetic response generation. The majority of existing methods for empathetic
response generation rely on the emotion of the context to generate empathetic responses. However,
empathy is much more than generating responses with an appropriate emotion. It also often entails
subtle expressions of understanding and personal resonance with the situation of the other
interlocutor. Unfortunately, such qualities are difficult to quantify, and the datasets lack
relevant annotations. To address this issue, in this paper we propose an approach that relies on
exemplars to cue the generative model on fine stylistic properties that signal empathy to the
interlocutor. To this end, we employ dense passage retrieval to extract relevant exemplary responses
from the training set. Three elements of human communication—emotional presence, interpretation, and
exploration—and sentiment are additionally introduced using synthetic labels to guide the generation
towards empathy. The human evaluation is also extended by these elements of human communication. We
empirically show that these approaches yield significant improvements in empathetic response quality
in terms of both automated and human-evaluated metrics. The implementation is available at
https://github.com/declare-lab/exemplary-empathy.
Improving aspect-level sentiment analysis with aspect extraction
★
Aspect-based sentiment analysis (ABSA), a popular research area in NLP, has two distinct
parts—aspect extraction (AE) and labelling the aspects with sentiment polarity (ALSA). Although
distinct, these two tasks are highly correlated. The work primarily hypothesizes that transferring
knowledge from a pre-trained AE model can benefit the performance of ALSA models. Based on this
hypothesis, word embeddings are obtained during AE and, subsequently, feed that to the ALSA model.
Empirically, this work shows that the added information significantly improves the performance of
three different baseline ALSA models on two distinct domains. This improvement also translates well
across domains between AE and ALSA tasks.
Improving zero-shot learning baselines with commonsense knowledge
Zero-shot learning — the problem of training and testing on a completely disjoint set of classes —
relies greatly on its ability to transfer knowledge from train classes to test classes.
Traditionally semantic embeddings consisting of human-defined attributes or distributed word
embeddings are used to facilitate this transfer by improving the association between visual and
semantic embeddings. In this paper, we take advantage of explicit relations between nodes defined in
ConceptNet, a commonsense knowledge graph, to generate commonsense embeddings of the class labels by
using a graph convolution network-based autoencoder. Our experiments performed on three standard
benchmark datasets surpass the strong baselines when we fuse our commonsense embeddings with
existing semantic embeddings, i.e., human-defined attributes and distributed word embeddings. This
work paves the path to more brain-inspired approaches to zero-short learning.
KNOT: Knowledge Distillation using Optimal Transport for Solving NLP Tasks
We propose a new approach, Knowledge Distillation using Optimal Transport (KNOT), to distill the
natural language semantic knowledge from multiple teacher networks to a student network. KNOT aims
to train a (global) student model by learning to minimize the optimal transport cost of its assigned
probability distribution over the labels to the weighted sum of probabilities predicted by the
(local) teacher models, under the constraints that the student model does not have access to teacher
models’ parameters or training data. To evaluate the quality of knowledge transfer, we introduce a
new metric, Semantic Distance (SD), that measures semantic closeness between the predicted and
ground truth label distributions. The proposed method shows improvements in the global model’s SD
performance over the baseline across three NLP tasks while performing on par with Entropy-based
distillation on standard accuracy and F1 metrics. The implementation pertaining to this work is
publicly available at https://github.com/declare-lab/KNOT .
Knowledge enhanced reflection generation for counseling dialogues
In this paper, we study the effect of commonsense and domain knowledge while generating responses
in counseling conversations using retrieval and generative methods for knowledge integration. We
propose a pipeline that collects domain knowledge through web mining, and show that retrieval from
both domain-specific and commonsense knowledge bases improves the quality of generated responses. We
also present a model that incorporates knowledge generated by COMET using soft positional encoding
and masked self-attention. We show that both retrieved and COMET-generated knowledge improve the
system’s performance as measured by automatic metrics and also by human evaluation. Lastly, we
present a comparative study on the types of knowledge encoded by our system showing that causal and
intentional relationships benefit the generation task more than other types of commonsense
relations.
MM-Align: Learning Optimal Transport-based Alignment Dynamics for Fast and Accurate Inference on Missing Modality Sequences
Existing multimodal tasks mostly target at the complete input modality setting, i.e., each modality
is either complete or completely missing in both training and test sets. However, the randomly
missing situations have still been underexplored. In this paper, we present a novel approach named
MM-Align to address the missing-modality inference problem. Concretely, we propose 1) an alignment
dynamics learning module based on the theory of optimal transport (OT) for missing data imputation;
2) a denoising training algorithm to enhance the quality of imputation as well as the accuracy of
model predictions. Compared with previous generative methods which devote to restoring the missing
inputs, MM-Align learns to capture and imitate the alignment dynamics between modality sequences.
Results of comprehensive experiments on two multimodal tasks empirically demonstrate that our method
can perform more accurate and faster inference and alleviate the overfitting issue under different
missing conditions.
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
★
Deep Learning and its applications have catalyzed impactful research across modalities. This survey
presents a detailed overview of current trends in vision-and-language research, covering task
formulations, solution approaches, evaluation strategies, and emerging challenges. We highlight
multi-disciplinary patterns and outline directions toward more modular and transparent multimodal
systems.
Multiview contextual commonsense inference: A new dataset and task
Abstract:Contextual commonsense inference is the task of generating various types of explanations
around the events in a dyadic dialogue, including cause, motivation, emotional reaction, and others.
Producing a coherent and non-trivial explanation requires awareness of the dialogue's structure
and of how an event is grounded in the context. In this work, we create CICEROv2, a dataset
consisting of 8,351 instances from 2,379 dialogues, containing multiple human-written answers for
each contextual commonsense inference question, representing a type of explanation on cause,
subsequent event, motivation, and emotional reaction. We show that the inferences in CICEROv2 are
more semantically diverse than other contextual commonsense inference datasets. To solve the
inference task, we propose a collection of pre-training objectives, including concept denoising and
utterance sorting to prepare a pre-trained model for the downstream contextual commonsense inference
task. Our results show that the proposed pre-training objectives are effective at adapting the
pre-trained T5-Large model for the contextual commonsense inference task.
PIP: Physical Interaction Prediction via mental simulation with span selection
Accurate prediction of physical interaction outcomes is a crucial component of human intelligence
and is important for safe and efficient deployments of robots in the real world. While there are
existing vision-based intuitive physics models that learn to predict physical interaction outcomes,
they mostly focus on generating short sequences of future frames based on physical properties (e.g.
mass, friction and velocity) extracted from visual inputs or a latent space. However, there is a
lack of intuitive physics models that are tested on long physical interaction sequences with
multiple interactions among different objects. We hypothesize that selective temporal attention
during approximate mental simulations helps humans in physical interaction outcome prediction. With
these motivations, we propose a novel scheme: Physical Interaction Prediction via Mental Simulation
with Span Selection (PIP). It utilizes a deep generative model to model approximate mental
simulations by generating future frames of physical interactions before employing selective temporal
attention in the form of span selection for predicting physical interaction outcomes. To evaluate
our model, we further propose the large-scale SPACE+ dataset of synthetic videos with long sequences
of three prime physical interactions in a 3D environment. Our experiments show that PIP outperforms
human, baseline, and related intuitive physics models that utilize mental simulation. Furthermore,
PIP's span selection module effectively identifies the frames indicating key physical interactions
among objects, allowing for added interpretability.
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)
RelationPrompt: Leveraging Prompts to Generate Synthetic Data for Zero-Shot Relation Triplet Extraction
★
Despite the importance of relation extraction in building and representing knowledge, less research
is focused on generalizing to unseen relations types. We introduce the task setting of Zero-Shot
Relation Triplet Extraction (ZeroRTE) to encourage further research in low-resource relation
extraction methods. Given an input sentence, each extracted triplet consists of the head entity,
relation label, and tail entity where the relation label is not seen at the training stage. To solve
ZeroRTE, we propose to synthesize relation examples by prompting language models to generate
structured texts. Concretely, we unify language model prompts and structured text approaches to
design a structured prompt template for generating synthetic relation samples when conditioning on
relation label prompts (RelationPrompt). To overcome the limitation for extracting multiple relation
triplets in a sentence, we design a novel Triplet Search Decoding method. Experiments on FewRel and
Wiki-ZSL datasets show the efficacy of RelationPrompt for the ZeroRTE task and zero-shot relation
classification. Our code and data are available at github.com/declare-lab/RelationPrompt.
SANCL: Multimodal Review Helpfulness Prediction with Selective Attention and Natural Contrastive Learning
With the boom of e-commerce, Multimodal Review Helpfulness Prediction (MRHP) that identifies the
helpfulness score of multimodal product reviews has become a research hotspot. Previous work on this
task focuses on attention-based modality fusion, information integration, and relation modeling,
which primarily exposes the following drawbacks: 1) the model may fail to capture the really
essential information due to its indiscriminate attention formulation; 2) lack appropriate modeling
methods that takes full advantage of correlation among provided data. In this paper, we propose
SANCL: Selective Attention and Natural Contrastive Learning for MRHP. SANCL adopts a probe-based
strategy to enforce high attention weights on the regions of greater significance. It also
constructs a contrastive learning framework based on natural matching properties in the dataset.
Experimental results on two benchmark datasets with three categories show that SANCL achieves
state-of-the-art baseline performance with lower memory consumption.
SAT: Improving Semi-Supervised Text Classification with Simple Instance-Adaptive Self-Training
Self-training methods have been explored in recent years and have exhibited great performance in
improving semi-supervised learning. This work presents a simple instance-adaptive self-training
method (SAT) for semi-supervised text classification. SAT first generates two augmented views for
each unlabeled data, and then trains a meta learner to automatically identify the relative strength
of augmentations based on the similarity between the original view and the augmented views. The
weakly-augmented view is fed to the model to produce a pseudo-label and the strongly-augmented view
is used to train the model to predict the same pseudo-label. We conducted extensive experiments and
analyses on three text classification datasets and found that with varying sizes of labeled training
data, SAT consistently shows competitive performance compared to existing semi-supervised learning
methods.
So Different Yet So Alike! Constrained Unsupervised Text Style Transfer
Automatic transfer of text between domains has become popular in recent times. One of its aims is
to preserve the semantic content while adapting to the target domain. However, it does not
explicitly maintain other attributes between the source and translated text: e.g., text length and
descriptiveness. Maintaining constraints in transfer has several downstream applications, including
data augmentation and debiasing. We introduce a method for such constrained unsupervised text style
transfer by introducing two complementary losses to the generative adversarial network (GAN) family
of models. Unlike the competing losses used in GANs, we introduce cooperative losses where the
discriminator and the generator cooperate and reduce the same loss. The first is a contrastive loss
and the second is a classification loss — aiming to regularize the latent space further and bring
similar sentences closer together. We demonstrate that such training retains lexical, syntactic and
domain-specific constraints between domains for multiple benchmark datasets, including ones where
more than one attribute change. We show that the complementary cooperative losses improve text
quality, according to both automated and human evaluation measures.
Towards solving NLP tasks with optimal transport loss
We explore the use of optimal transport-based loss functions for solving NLP tasks, framing label
alignment and distributional matching as transport problems. The proposed loss improves model
calibration and helps in tasks where alignment between predicted and target distributions matters,
demonstrating gains across several sequence labeling and classification benchmarks.
Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering
We propose a simple refactoring of multi-choice question answering (MCQA) tasks as a series of
binary classifications. The MCQA task is generally performed by scoring each (question, answer) pair
normalized over all the pairs, and then selecting the answer from the pair that yield the highest
score. For n answer choices, this is equivalent to an n-class classification setup where only one
class (true answer) is correct. We instead show that classifying (question, true answer) as positive
instances and (question, false answer) as negative instances is significantly more effective across
various models and datasets. We show the efficacy of our proposed approach in different tasks –
abductive reasoning, commonsense question answering, science question answering, and sentence
completion. Our DeBERTa binary classification model reaches the top or close to the top performance
on public leaderboards for these tasks. The source code of the proposed approach is available at
https://github.com/declare-lab/TEAM.
Vector-Quantized Input-Contextualized Soft Prompts for Natural Language Understanding
Prompt Tuning has been largely successful as a parameter-efficient method of conditioning
large-scale pre-trained language models to perform downstream tasks. Thus far, soft prompt tuning
learns a fixed set of task-specific continuous vectors, i.e., soft tokens that remain static across
the task samples. A fixed prompt, however, may not generalize well to the diverse kinds of inputs
the task comprises. In order to address this, we propose Vector-quantized Input-contextualized
Prompts (VIP) as an extension to the soft prompt tuning framework. VIP particularly focuses on two
aspects—contextual prompts that learns input-specific contextualization of the soft prompt tokens
through a small-scale sentence encoder and quantized prompts that maps the contextualized prompts to
a set of learnable codebook vectors through a Vector quantization network. On various language
understanding tasks like SuperGLUE, QA, Relation classification, NER and NLI, VIP outperforms the
soft prompt tuning (PT) baseline by an average margin of 1.19%. Further, our generalization studies
show that VIP learns more robust prompt representations, surpassing PT by a margin of 0.6% - 5.3% on
Out-of-domain QA and NLI tasks respectively, and by 0.75% on Multi-Task setup over 4 tasks spanning
across 12 domains.
2021
Affect recognition for multimodal natural language processing
Language is inherently multimodal. It has many forms of appearance, like speech, gestures, facial
expressions, and head-nods. In an ideal human-machine conversational system, machines should
understand this multimodal language. Understanding human language also largely depends on the
machines’ ability to interpret emotions. Emotional sensitivity can prevent desultory answers
provided by these machines, thus making conversations more natural and engaging. For us humans,
emotions aid our learning, communication, and decision-making. Hence, over the past two decades,
there has been a significant effort to incorporate cognitive capabilities into machines so that they
can interpret, comprehend and express emotions. Computational analysis of human multimodal language
is an emerging research area in natural language processing (NLP). It expands the horizons of NLP to
study language used in face to face communication and in online multimedia. This form of language
contains modalities of language (in terms of spoken text), visual (in terms of gestures and facial
expressions) and acoustic (in terms of changes in the voice tone). At its core, this research area
is focused on modeling the three modalities and their complex interactions. This special issue on
Affect Recognition in Human Multimodal Language aims to facilitate the growth of this new research
direction in the community. The challenges of modeling human multimodal language can be split into
two major categories: (1) studying each modality individually and modeling each in a manner that can
be linked to other modalities (also known as intramodal dynamics, (2) linking the modalities by
modeling the interactions between them (also known as intermodal dynamics). Common forms of these
interactions include complementary or correlated information across modes. Intrinsic to each
modality, modeling human multimodal language is complex due to factors such as idiosyncrasy in
communicative styles, non-trivial alignment between modalities and unreliable or contradictory
information across modalities. Therefore, computational analysis of multimodal language becomes a
challenging research area. This special issue aimed at bringing together contributions from both
academics and practitioners in the context of affect recognition in multimodal language with a
primary focus on: 1. Multimodal affect recognition in monologues, 2. Effective multimodal fusion. 3.
Detecting affect in dyadic and multiparty multimodal conversations. Detecting affect in
conversations is more challenging than monologues. This is primarily due to the presence of complex
inter-dependency between speaker states in the conversation,
Aspect sentiment triplet extraction using reinforcement learning
★
Aspect Sentiment Triplet Extraction (ASTE) is the task of extracting triplets of aspect terms,
their associated sentiments, and the opinion terms that provide evidence for the expressed
sentiments. Previous approaches to ASTE usually simultaneously extract all three components or first
identify the aspect and opinion terms, then pair them up to predict their sentiment polarities. In
this work, we present a novel paradigm, ASTE-RL, by regarding the aspect and opinion terms as
arguments of the expressed sentiment in a hierarchical reinforcement learning (RL) framework. We
first focus on sentiments expressed in a sentence, then identify the target aspect and opinion terms
for that sentiment. This takes into account the mutual interactions among the triplet's components
while improving exploration and sample efficiency. Furthermore, this hierarchical RL setup enables
us to deal with multiple and overlapping triplets. In our experiments, we evaluate our model on
existing datasets from laptop and restaurant domains and show that it achieves state-of-the-art
performance. The implementation of this work is publicly available at
https://github.com/declare-lab/ASTE-RL.
Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis
★
Multimodal sentiment analysis aims to extract and integrate semantic information collected from
multiple modalities to recognize the expressed emotions and sentiment in multimodal data. This
research area’s major concern lies in developing an extraordinary fusion scheme that can extract and
integrate key information from various modalities. However, previous work is restricted by the lack
of leveraging dynamics of independence and correlation between modalities to reach top performance.
To mitigate this, we propose the Bi-Bimodal Fusion Network (BBFN), a novel end-to-end network that
performs fusion (relevance increment) and separation (difference increment) on pairwise modality
representations. The two parts are trained simultaneously such that the combat between them is
simulated. The model takes two bimodal pairs as input due to the known information imbalance among
modalities. In addition, we leverage a gated control mechanism in the Transformer architecture to
further improve the final output. Experimental results on three datasets (CMU-MOSI, CMU-MOSEI, and
UR-FUNNY) verifies that our model significantly outperforms the SOTA. The implementation of this
work is available at https://github.com/declare-lab/multimodal-deep-learning and
https://github.com/declare-lab/BBFN.
Causal augmentation for causal sentence classification
Scarcity of annotated causal texts leads to poor robustness when training state-of-the-art language
models for causal sentence classification. In particular, we found that models misclassify on
augmented sentences that have been negated or strengthened with respect to its causal meaning. This
is worrying since minor linguistic differences in causal sentences can have disparate meanings.
Therefore, we propose the generation of counterfactual causal sentences by creating contrast sets
(Gardner et al., 2020) to be included during model training. We experimented on two model
architectures and predicted on two out-of-domain corpora. While our strengthening schemes proved
useful in improving model performance, for negation, regular edits were insufficient. Thus, we also
introduce heuristics like shortening or multiplying root words of a sentence. By including a mixture
of edits when training, we achieved performance improvements beyond the baseline across both models,
and within and out of corpus’ domain, suggesting that our proposed augmentation can also help models
generalize.
CIDER: Commonsense inference for dialogue explanation and reasoning
Commonsense inference to understand and explain human language is a fundamental research problem in
natural language processing. Explaining human conversations poses a great challenge as it requires
contextual understanding, planning, inference, and several aspects of reasoning including causal,
temporal, and commonsense reasoning. In this work, we introduce CIDER – a manually curated dataset
that contains dyadic dialogue explanations in the form of implicit and explicit knowledge triplets
inferred using contextual commonsense inference. Extracting such rich explanations from
conversations can be conducive to improving several downstream applications. The annotated triplets
are categorized by the type of commonsense knowledge present (e.g., causal, conditional, temporal).
We set up three different tasks conditioned on the annotated dataset: Dialogue-level Natural
Language Inference, Span Extraction, and Multi-choice Span Selection. Baseline results obtained with
transformer-based models reveal that the tasks are difficult, paving the way for promising future
research. The dataset and the baseline implementations are publicly available at
https://github.com/declare-lab/CIDER .
Deep neural approaches to relation triplets extraction: a comprehensive survey
★
The task of relation extraction is about identifying entities and relations among them in free text
for the enrichment of structured knowledge bases (KBs). In this paper, we present a comprehensive
survey of this important research topic in natural language processing. Recently, with the advances
made in the continuous representation of words (word embeddings) and deep neural architectures, many
research works are published in the area of relation extraction. To help future research, we present
a comprehensive review of the recently published research works in relation extraction. Previous
surveys on this task covered only one aspect of relation extraction that is pipeline-based relation
extraction approaches at the sentence level. In this survey, we cover sentence-level relation
extraction to document-level relation extraction, pipeline-based approaches to joint extraction
approaches, annotated datasets to distantly supervised datasets along with few very recent research
directions such as zero-shot or few-shot relation extraction, noise mitigation in distantly
supervised datasets. Regarding neural architectures, we cover convolutional models, recurrent
network models, attention network models, and graph convolutional models in this survey. We survey
more than 100 publications in the field of relation extraction and present them in a structured way
based on their similarity in the specific task they tried to solve, their model architecture, the
datasets they used for experiments. We include the current state-of-the-art performance in several
datasets in this paper for comparison. In this paper, we have covered different aspects of research
in relation extraction field with a key focus on recent deep neural network-based methods. Also, we
identify possible future research directions. Hopefully, this will help future researchers to
identify the current research gaps and take the field forward.
Discriminative dictionary design for action classification in still images and videos
In this paper, we address the problem of action recognition from still images and videos.
Traditional local features such as SIFT and STIP invariably pose two potential problems: 1) they are
not evenly distributed in different entities of a given category and 2) many of such features are
not exclusive of the visual concept the entities represent. In order to generate a dictionary taking
the aforementioned issues into account, we propose a novel discriminative method for identifying
robust and category specific local features which maximize the class separability to a greater
extent. Specifically, we pose the selection of potent local descriptors as filtering-based feature
selection problem, which ranks the local features per category based on a novel measure of
distinctiveness. The underlying visual entities are subsequently represented based on the learned
dictionary, and this stage is followed by action classification using the random forest model
followed by label propagation refinement. The framework is validated on the action recognition
datasets based on still images (Stanford-40) as well as videos (UCF-50). We get 51.2% and 66.7%
recognition accuracy for Standford-40 and UCF-50, respectively. Compared to other representative
methods from the literature, our approach exhibits superior performances. This proves the
effectiveness of adaptive ranking methodology presented in this work.
DOZEN: Cross-domain zero shot named entity recognition with knowledge graph
With the new developments of natural language processing, increasing attention has been given to
the task of Named Entity Recognition (NER). However, the vast majority of work focus on a small
number of large-scale annotated datasets with a limited number of entities such as person, location
and organization. While other datasets have been introduced with domain-specific entities, the
smaller size of these largely limits the applicability of state-of-the-art deep models. Even if
there are promising new approaches for performing zero-shot learning (ZSL), they are not designed
for a cross-domain settings. We propose Cross Domain Zero Shot Named Entity Recognition with
Knowledge Graph (DOZEN), which learns the relations between entities across different domains from
an existing ontology of external knowledge and a set of analogies linking entities and domains.
Experiments performed on both large scale and domain-specific datasets indicate that DOZEN is the
most suitable option to extracts unseen entities in a target dataset from a different domain.
Exploring the role of context in utterance-level emotion, act and intent classification in conversations: An empirical study
the user inten-tion and background. In recent years, a number of context-aware approaches have been
proposed for various utterance-level dialogue understanding tasks. In this paper, we explore and
quantify the role of context for different aspects of a dialogue, namely emotion, dialogue act, and
intent identification, using state-of-the-art dialogue understanding methods as baselines.
Specifically, we employ various perturbations to distort the context of a given utterance and study
its impact on the different tasks and baselines. This provides us with insights into the fundamental
context factors that have immediate implications on different aspects of a dialogue. Such insights
may inspire more effective dialogue understanding models and provide support for future text
generation approaches.
Improving distantly supervised relation extraction with self-ensemble noise filtering
Distantly supervised models are very popular for relation extraction since we can obtain a large
amount of training data using the distant supervision method without human annotation. In distant
supervision, a sentence is considered as a source of a tuple if the sentence contains both entities
of the tuple. However, this condition is too permissive and does not guarantee the presence of
relevant relation-specific information in the sentence. As such, distantly supervised training data
contains much noise which adversely affects the performance of the models. In this paper, we propose
a self-ensemble filtering mechanism to filter out the noisy samples during the training process. We
evaluate our proposed framework on the New York Times dataset which is obtained via distant
supervision. Our experiments with multiple state-of-the-art neural relation extraction models show
that our proposed filtering mechanism improves the robustness of the models and increases their F1
scores.
Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis
★
In multimodal sentiment analysis (MSA), the performance of a model highly depends on the quality of
synthesized embeddings. These embeddings are generated from the upstream process called multimodal
fusion, which aims to extract and combine the input unimodal raw data to produce a richer multimodal
representation. Previous work either back-propagates the task loss or manipulates the geometric
property of feature spaces to produce favorable fusion results, which neglects the preservation of
critical task-related information that flows from input to the fusion results. In this work, we
propose a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual
Information (MI) in unimodal input pairs (inter-modality) and between multimodal fusion result and
unimodal input in order to maintain task-related information through multimodal fusion. The
framework is jointly trained with the main task (MSA) to improve the performance of the downstream
MSA task. To address the intractable issue of MI bounds, we further formulate a set of
computationally simple parametric and non-parametric methods to approximate their truth value.
Experimental results on the two widely used datasets demonstrate the efficacy of our approach.
Information fusion for affective computing and sentiment analysis
Emotions are intrinsically part of our mental activity and play a key role in communication and
decision-making processes [1,2]. Emotion, cognition, and action interact in feedback loops and
emotion can be viewed in a structural model tied to adaptation. Besides being important for the
advancement of AI, detecting and interpreting emotional information is key in multiple areas of
computer science, e.g., human-agent, -computer, and -robot interaction, but also e-learning,
e-health, domotics, automotive, security, user profiling and personalization. In recent years,
emotion and sentiment analysis has become increasingly popular also for processing social media data
on social networks, online communities, blogs, Wikis, microblogging platforms, and other online
collaborative media [3,4]. The distillation of knowledge from such a big amount of unstructured
information, however, is an extremely difficult task, as the contents of today’s Web are perfectly
suitable for human consumption, but remain hardly accessible to machines. The opportunity to capture
the opinions of the general public about social events, political movements, company strategies,
marketing campaigns, and product preferences has raised growing interest both within the scientific
community, leading to many exciting open challenges, as well as in the business world, due to the
remarkable benefits to be had from marketing [5] and financial market prediction [6]. Most of
existing approaches to affective computing and sentiment analysis are still based on the syntactic
representation of text, a method that relies mainly on word co-occurrence frequencies. Such
algorithms are limited by the fact that they can only process information they can ‘see’. As human
text processors, we do not have such limitations as every word we see activates a cascade of
semantically related concepts, relevant episodes, emotions, and sensory experiences [7], all of
which enable the completion of other complex NLP tasks such as subjectivity detection [8], anaphora
resolution [9], personality recognition [10], and more. Information fusion can aid to mimic the way
humans process and analyze text and, hence, overcome the limitations of standard approaches to
affective computing and sentiment analysis.
Investigating gender bias in BERT
★
In this work, we analyze the gender bias induced by BERT in downstream tasks. We also propose
solutions to reduce gender bias. Contextual language models (CLMs) have pushed the NLP benchmarks to
a new height. It has become a new norm to utilize CLM-provided word embeddings in downstream tasks
such as text classification. However, unless addressed, CLMs are prone to learn intrinsic gender
bias in the dataset. As a result, predictions of downstream NLP models can vary noticeably by
varying gender words, such as replacing “he” to “she”, or even gender-neutral words. In this paper,
we focus our analysis on a popular CLM, i.e., BERT\documentclass[12pt]{minimal} \usepackage{amsmath}
\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}
\usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text
{BERT}$$\end{document}. We analyze the gender bias it induces in five downstream tasks related to
emotion and sentiment intensity prediction. For each task, we train a simple regressor utilizing
BERT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts}
\usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {BERT}$$\end{document}’s word embeddings.
We then evaluate the gender bias in regressors using an equity evaluation corpus. Ideally and from
the specific design, the models should discard gender informative features from the input. However,
the results show a significant dependence of the system’s predictions on gender-particular words and
phrases. We claim that such biases can be reduced by removing gender-specific features from word
embedding. Hence, for each layer in BERT, we identify directions that primarily encode gender
information. The space formed by such directions is referred to as the gender subspace in the
semantic space of word embeddings. We propose an algorithm that finds fine-grained gender
directions, i.e., one primary direction for each BERT layer. This obviates the need of realizing
gender subspace in multiple dimensions and prevents other crucial information from being omitted.
Experiments show that removing embedding components in gender directions achieves great success in
reducing BERT-induced bias in the downstream tasks. The investigation reveals significant gender
bias a contextualized language model ( i.e., BERT\documentclass[12pt]{minimal} \usepackage{amsmath}
\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}
\usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text
{BERT}$$\end{document}) induces in downstream tasks. The proposed solution seems promising in
reducing such biases.
M2H2: A multimodal multiparty hindi dataset for humor recognition in conversations
Humor recognition in conversations is a challenging task that has recently gained popularity due to
its importance in dialogue understanding, including in multimodal settings (i.e., text, acoustics,
and visual). The few existing datasets for humor are mostly in English. However, due to the
tremendous growth in multilingual content, there is a great demand to build models and systems that
support multilingual information access. To this end, we propose a dataset for Multimodal Multiparty
Hindi Humor (M2H2) recognition in conversations containing 6,191 utterances from 13 episodes of a
very popular TV series ”Shrimaan Shrimati Phir Se”. Each utterance is annotated with humor/non-humor
labels and encompasses acoustic, visual, and textual modalities. We propose several strong
multimodal baselines and show the importance of contextual and multimodal information for humor
recognition in conversations. The empirical results on M2H2 dataset demonstrate that multimodal
information complements unimodal information for humor recognition. The dataset and the baselines
are available at http://www.iitp.ac.in/~ai-nlp-ml/resources.html and
https://github.com/declare-lab/M2H2-dataset.
More identifiable yet equally performant transformers for text classification
Interpretability is an important aspect of the trustworthiness of a model’s predictions.
Transformer’s predictions are widely explained by the attention weights, i.e., a probability
distribution generated at its self-attention unit (head). Current empirical studies provide shreds
of evidence that attention weights are not explanations by proving that they are not unique. A
recent study showed theoretical justifications to this observation by proving the
non-identifiability of attention weights. For a given input to a head and its output, if the
attention weights generated in it are unique, we call the weights identifiable. In this work, we
provide deeper theoretical analysis and empirical observations on the identifiability of attention
weights. Ignored in the previous works, we find the attention weights are more identifiable than we
currently perceive by uncovering the hidden role of the key vector. However, the weights are still
prone to be non-unique attentions that make them unfit for interpretation. To tackle this issue, we
provide a variant of the encoder layer that decouples the relationship between key and value vector
and provides identifiable weights up to the desired length of the input. We prove the applicability
of such variations by providing empirical justifications on varied text classification tasks. The
implementations are available at https://github.com/declare-lab/identifiable-transformers .
Persuasive dialogue understanding: The baselines and negative results
Persuasion aims at forming one's opinion and action via a series of persuasive messages containing
persuader's strategies. Due to its potential application in persuasive dialogue systems, the task of
persuasive strategy recognition has gained much attention lately. Previous methods on user intent
recognition in dialogue systems adopt recurrent neural network (RNN) or convolutional neural network
(CNN) to model context in conversational history, neglecting the tactic history and intra-speaker
relation. In this paper, we demonstrate the limitations of a Transformer-based approach coupled with
Conditional Random Field (CRF) for the task of persuasive strategy recognition. In this model, we
leverage inter- and intra-speaker contextual semantic features, as well as label dependencies to
improve the recognition. Despite extensive hyper-parameter optimizations, this architecture fails to
outperform the baseline methods. We observe two negative results. Firstly, CRF cannot capture
persuasive label dependencies, possibly as strategies in persuasive dialogues do not follow any
strict grammar or rules as the cases in Named Entity Recognition (NER) or part-of-speech (POS)
tagging. Secondly, the Transformer encoder trained from scratch is less capable of capturing
sequential information in persuasive dialogues than Long Short-Term Memory (LSTM). We attribute this
to the reason that the vanilla Transformer encoder does not efficiently consider relative position
information of sequence elements.
Phonetic-enriched text representation for Chinese sentiment analysis with reinforcement learning
★
The Chinese pronunciation system offers two characteristics that distinguish it from other
languages: deep phonemic orthography and intonation variations. We are the first to argue that these
two important properties can play a major role in Chinese sentiment analysis. Particularly, we
propose two effective features to encode phonetic information. Next, we develop a Disambiguate
Intonation for Sentiment Analysis (DISA) network using a reinforcement network. It functions as
disambiguating intonations for each Chinese character (pinyin). Thus, a precise phonetic
representation of Chinese is learned. Furthermore, we also fuse phonetic features with textual and
visual features in order to mimic the way humans read and understand Chinese text. Experimental
results on five different Chinese sentiment analysis datasets show that the inclusion of phonetic
features significantly and consistently improves the performance of textual and visual
representations and outshines the state-of-the-art Chinese character level representations.
Recognizing emotion cause in conversations
★
We address the problem of recognizing emotion cause in conversations, define two novel sub-tasks of
this problem, and provide a corresponding dialogue-level dataset, along with strong
transformer-based baselines. The dataset is available at https://github.com/declare-lab/RECCON.
Recognizing the cause behind emotions in text is a fundamental yet under-explored area of research
in NLP. Advances in this area hold the potential to improve interpretability and performance in
affect-based models. Identifying emotion causes at the utterance level in conversations is
particularly challenging due to the intermingling dynamics among the interlocutors. We introduce the
task of Recognizing Emotion Cause in CONversations with an accompanying dataset named RECCON,
containing over 1,000 dialogues and 10,000 utterance cause/effect pairs. Furthermore, we define
different cause types based on the source of the causes, and establish strong Transformer-based
baselines to address two different sub-tasks on this dataset. Our transformer-based baselines, which
leverage contextual pre-trained embeddings, such as RoBERTa, outperform the state-of-the-art emotion
cause extraction approaches on our dataset. We introduce a new task highly relevant for
(explainable) emotion-aware artificial intelligence: recognizing emotion cause in conversations,
provide a new highly challenging publicly available dialogue-level dataset for this task, and give
strong baseline results on this dataset.
Retrieving and reading: A comprehensive survey on open-domain question answering
★
Abstract:Open-domain Question Answering (OpenQA) is an important task in Natural Language
Processing (NLP), which aims to answer a question in the form of natural language based on
large-scale unstructured documents. Recently, there has been a surge in the amount of research
literature on OpenQA, particularly on techniques that integrate with neural Machine Reading
Comprehension (MRC). While these research works have advanced performance to new heights on
benchmark datasets, they have been rarely covered in existing surveys on QA systems. In this work,
we review the latest research trends in OpenQA, with particular attention to systems that
incorporate neural MRC techniques. Specifically, we begin with revisiting the origin and development
of OpenQA systems. We then introduce modern OpenQA architecture named "Retriever-Reader" and
analyze the various systems that follow this architecture as well as the specific techniques adopted
in each of the components. We then discuss key challenges to developing OpenQA systems and offer an
analysis of benchmarks that are commonly used. We hope our work would enable researchers to be
informed of the recent advancement and also the open challenges in OpenQA research, so as to
stimulate further progress in this field.
STACK: Sentence ordering with temporal commonsense knowledge
Sentence order prediction is the task of finding the correct order of sentences in a randomly
ordered document. Correctly ordering the sentences requires an understanding of coherence with
respect to the chronological sequence of events described in the text. Document-level contextual
understanding and commonsense knowledge centered around these events are often essential in
uncovering this coherence and predicting the exact chronological order. In this paper, we introduce
STaCK — a framework based on graph neural networks and temporal commonsense knowledge to model
global information and predict the relative order of sentences. Our graph network accumulates
temporal evidence using knowledge of ‘past’ and ‘future’ and formulates sentence ordering as a
constrained edge classification problem. We report results on five different datasets, and
empirically show that the proposed method is naturally suitable for order prediction. The
implementation of this work is available at: https://github.com/declare-lab/sentence-ordering .
2020
Anaphora and coreference resolution: A review
★
Entity resolution aims at resolving repeated references to an entity in a document and forms a core
component of natural language processing (NLP) research. This field possesses immense potential to
improve the performance of other NLP fields like machine translation, sentiment analysis, paraphrase
detection, summarization, etc. The area of entity resolution in NLP has seen proliferation of
research in two separate sub-areas namely: anaphora resolution and coreference resolution. Through
this review article, we aim at clarifying the scope of these two tasks in entity resolution. We also
carry out a detailed analysis of the datasets, evaluation metrics and research methods that have
been adopted to tackle this NLP problem. This survey is motivated with the aim of providing the
reader with a clear understanding of what constitutes this NLP problem and the issues that require
attention.
Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research
★
Sentiment analysis as a field has come a long way since it was first introduced as a task nearly 20
years ago. It has widespread commercial applications in various domains like marketing, risk
management, market research, and politics, to name a few. Given its saturation in specific subtasks
— such as sentiment polarity classification — and datasets, there is an underlying perception that
this field has reached its maturity. In this article, we discuss this perception by pointing out the
shortcomings and under-explored, yet key aspects of this field necessary to attain true sentiment
understanding. We analyze the significant leaps responsible for its current relevance. Further, we
attempt to chart a possible course for this field that covers many overlooked and unanswered
questions.
CMU-MOSEAS: A multimodal language dataset for Spanish, Portuguese, German and French
★
Modeling multimodal language is a core research area in natural language processing. While
languages such as English have relatively large multimodal language resources, other widely spoken
languages across the globe have few or no large-scale datasets in this area. This disproportionately
affects native speakers of languages other than English. As a step towards building more equitable
and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for
Spanish, Portuguese, German and French. The proposed dataset, called CMU-MOSEAS (CMU Multimodal
Opinion Sentiment, Emotions and Attributes), is the largest of its kind with 40,000 total labelled
sentences. It covers a diverse set topics and speakers, and carries supervision of 20 labels
including sentiment (and subjectivity), emotions, and attributes. Our evaluations on a
state-of-the-art multimodal model demonstrates that CMU-MOSEAS enables further research for
multilingual studies in multimodal language.
Conversational Transfer Learning for Emotion Recognition
★
Recognizing emotions in conversations is a challenging task due to the presence of contextual
dependencies governed by self- and inter-personal influences. Recent approaches have focused on
modeling these dependencies primarily via supervised learning. However, purely supervised strategies
demand large amounts of annotated data, which is lacking in most of the available corpora in this
task. To tackle this challenge, we look at transfer learning approaches as a viable alternative.
Given the large amount of available conversational data, we investigate whether generative
conversational models can be leveraged to transfer affective knowledge for detecting emotions in
context. We propose an approach, TL-ERC, where we pre-train a hierarchical dialogue model on
multi-turn conversations (source) and then transfer its parameters to a conversational emotion
classifier (target). In addition to the popular practice of using pre-trained sentence encoders, our
approach also incorporates recurrent parameters that model inter-sentential context across the whole
conversation. Based on this idea, we perform several experiments across multiple datasets and find
improvement in performance and robustness against limited training data. TL-ERC also achieves better
validation performances in significantly fewer epochs. Overall, we infer that knowledge acquired
from dialogue generators can indeed help recognize emotions in conversations.
COSMIC: Commonsense knowledge for emotion identification in conversations
★
In this paper, we address the task of utterance level emotion recognition in conversations using
commonsense knowledge. We propose COSMIC, a new framework that incorporates different elements of
commonsense such as mental states, events, and causal relations, and build upon them to learn
interactions between interlocutors participating in a conversation. Current state-of-theart methods
often encounter difficulties in context propagation, emotion shift detection, and differentiating
between related emotion classes. By learning distinct commonsense representations, COSMIC addresses
these challenges and achieves new state-of-the-art results for emotion recognition on four different
benchmark conversational datasets. Our code is available at
https://github.com/declare-lab/conv-emotion .
Dialogue systems with audio context
Abstract Research on building dialogue systems that converse with humans naturally has recently
attracted a lot of attention. Most work on this area assumes text-based conversation, where the user
message is modeled as a sequence of words in a vocabulary. Real-world human conversation, in
contrast, involves other modalities, such as voice, facial expression and body language, which can
influence the conversation significantly in certain scenarios. In this work, we explore the impact
of incorporating the audio features of the user message into generative dialogue systems.
Specifically, we first design an auxiliary response retrieval task for audio representation
learning. Then, we use word-level modality fusion to incorporate the audio features as additional
context to our main generative model. Experiments show that our audio-augmented model outperforms
the audio-free counterpart on perplexity, response diversity and human evaluation.
KinGDOM: Knowledge-Guided DOMain adaptation for sentiment analysis
★
Cross-domain sentiment analysis has received significant attention in recent years, prompted by the
need to combat the domain gap between different applications that make use of sentiment analysis. In
this paper, we take a novel perspective on this task by exploring the role of external commonsense
knowledge. We introduce a new framework, KinGDOM, which utilizes the ConceptNet knowledge graph to
enrich the semantics of a document by providing both domain-specific and domain-general background
concepts. These concepts are learned by training a graph convolutional autoencoder that leverages
inter-domain concepts in a domain-invariant manner. Conditioning a popular domain-adversarial
baseline method with these learned concepts helps improve its performance over state-of-the-art
approaches, demonstrating the efficacy of our proposed framework.
MIME: MIMicking emotions for empathetic response generation
★
Current approaches to empathetic response generation view the set of emotions expressed in the
input text as a flat structure, where all the emotions are treated uniformly. We argue that
empathetic responses often mimic the emotion of the user to a varying degree, depending on its
positivity or negativity and content. We show that the consideration of these polarity-based emotion
clusters and emotional mimicry results in improved empathy and contextual relevance of the response
as compared to the state-of-the-art. Also, we introduce stochasticity into the emotion mixture that
yields emotionally more varied empathetic responses than the previous work. We demonstrate the
importance of these factors to empathetic response generation using both automatic- and human-based
evaluations. The implementation of MIME is publicly available at https://github.com/declare-lab/MIME
.
MISA: Modality-invariant and-specific representations for multimodal sentiment analysis
★
Human emotion judgments usually receive information from multiple modalities such as language,
audio, as well as facial expressions and gestures. Because different modalities are represented
differently, multimodal data exhibit redundancy and complementarity, so a reasonable multimodal
fusion approach is essential to improve the accuracy of sentiment analysis. Inspired by the
Crossmodal Transformer for multimodal data fusion in the MulT (Multimodal Transformer) model, this
paper adds the Crossmodal transformer for modal enhancement of different modal data in the fusion
part of the MISA (Modality-Invariant and -Specific Representations for Multimodal Sentiment
Analysis) model, and proposes three MISA-CT models. Tested on two publicly available multimodal
sentiment analysis datasets MOSI and MOSEI, the experimental results of the models outperformed the
original MISA model.
MTAG: Modal-temporal attention graph for unaligned human multimodal language sequences
★
Human communication is multimodal in nature; it is through multiple modalities such as language,
voice, and facial expressions, that opinions and emotions are expressed. Data in this domain
exhibits complex multi-relational and temporal interactions. Learning from this data is a
fundamentally challenging research problem. In this paper, we propose Modal-Temporal Attention Graph
(MTAG). MTAG is an interpretable graph-based neural model that provides a suitable framework for
analyzing multimodal sequential data. We first introduce a procedure to convert unaligned multimodal
sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions
across modalities and through time. Then, a novel graph fusion operation, called MTAG fusion, along
with a dynamic pruning and read-out technique, is designed to efficiently process this
modal-temporal graph and capture various interactions. By learning to focus only on the important
interactions within the graph, MTAG achieves state-of-the-art performance on multimodal sentiment
analysis and emotion recognition benchmarks, while utilizing significantly fewer model parameters.
SenticNet 6: Ensemble application of symbolic and subsymbolic AI for sentiment analysis
★
Deep learning has unlocked new paths towards the emulation of the peculiarly-human capability of
learning from examples. While this kind of bottom-up learning works well for tasks such as image
classification or object detection, it is not as effective when it comes to natural language
processing. Communication is much more than learning a sequence of letters and words: it requires a
basic understanding of the world and social norms, cultural awareness, commonsense knowledge, etc.;
all things that we mostly learn in a top-down manner. In this work, we integrate top-down and
bottom-up learning via an ensemble of symbolic and subsymbolic AI tools, which we apply to the
interesting problem of polarity detection from text. In particular, we integrate logical reasoning
within deep learning architectures to build a new version of SenticNet, a commonsense knowledge base
for sentiment analysis.
Social media marketing and financial forecasting
The last decade has witnessed a huge development in social media interactions. To2 day, social
media is becoming ubiquitous in two senses: (1) it crosses boundaries and 3 expands to more nations,
and (2) it deeply intervenes in our personal life and economic 4 activities. This outburst of social
media data has led to an extensive amount of research 5 for analyzing and extracting useful
knowledge and workable patterns from social media 6 data (Hussain and Cambria, 2018). Among those,
however, the academic interest in 7 social media marketing and financial forecasting has only
increased in recent years (Xing 8 et al., 2018b). For example, only one special issue (Hajli and
Laroche, 2019) and two 9 workshops (Hahn et al., 2018; Chen et al., 2019) have been organized on
these two topics, 10 respectively. 11 Social media marketing aims to communicate with potential
customers to build brand 12 loyalty, attract attention, and increase sales (Cambria et al., 2012).
As the marketing 13 activities migrate to an online interactive environment, companies can listen-in
the cus14 tomer feedbacks as well as actively influence customer behavior (Constantinides, 2009). 15
Social media-based financial forecasting involves applications such as asset allocation 16
strategies, pricing and valuing financial assets, and stock market prediction (Xing et al., 17
2018a; Malandri et al., 2018). These applications are not static procedures, to the con18 trary,
they continue to collect information and accordingly revise the output. Both 19 social media
marketing and financial forecasting require understanding and leveraging 20 the characteristics of
social media data. 21 We identify three major characteristics of social media data. First, the
volume of 22 social media data is huge. Moreover, marketing and financial forecasting tasks often 23
require real-time analysis of those data. Hence the processing methods have to be effi24 cient.
Second, social media data are becoming multi-modal. Even pure-text data often 25 comes with extra
dimensions, such as timestamps and location tags. Making use of such 26 information is what social
media research has more than NLP research. Third, before 27 the surge of social media data, there
are existing information sources to support market28 ing and financial forecasting. Therefore,
methods often have to fuse the knowledge from 29 social media data and other sources. 30 In line
with the above discussion, this special issue focuses on the presentation of 31 marketing and
financial forecasting researches that tackle the three characteristics of 32 social media data. 33
Utterance-level dialogue understanding: An empirical study
Abstract:The recent abundance of conversational data on the Web and elsewhere calls for effective
NLP systems for dialog understanding. Complete utterance-level understanding often requires context
understanding, defined by nearby utterances. In recent years, a number of approaches have been
proposed for various utterance-level dialogue understanding tasks. Most of these approaches account
for the context for effective understanding. In this paper, we explore and quantify the role of
context for different aspects of a dialogue, namely emotion, intent, and dialogue act
identification, using state-of-the-art dialog understanding methods as baselines. Specifically, we
employ various perturbations to distort the context of a given utterance and study its impact on the
different tasks and baselines. This provides us with insights into the fundamental contextual
controlling factors of different aspects of a dialogue. Such insights can inspire more effective
dialogue understanding models, and provide support for future text generation approaches. The
implementation pertaining to this work is available at this https URL.
2019
An Attention-Based Model for Learning Dynamic Interaction Networks
In the physical world, complex systems are generally created as the composition of multiple
primitive components that interact with each other rather than a single monolithic structure.
Recently, spatio-temporal graphs received a reasonable amount of attention from the research
community since they emerged as a natural representational tool able to capture the interactive and
interrelated structure of a complex problem. To better understand the nature of complex systems,
there is the need to define models that can easily explain the learned causal relationship. To this
end, we propose an attentive model able to learn and project the relational structure into a
fixed-size embedding. Such representation naturally captures the dynamic influence that each
neighbors exert over a given vertex providing a valuable description of the problem setting. The
proposed architecture has been extensively evaluated against strong baselines on toy as well as
real-world tasks, such as prediction of household energy load and traffic congestion.
DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation
★
Emotion recognition in conversation (ERC) has received much attention, lately, from researchers due
to its potential widespread applications in diverse areas, such as health-care, education, and human
resources. In this paper, we present Dialogue Graph Convolutional Network (DialogueGCN), a graph
neural network based approach to ERC. We leverage self and inter-speaker dependency of the
interlocutors to model conversational context for emotion recognition. Through the graph network,
DialogueGCN addresses context propagation issues present in the current RNN-based methods and
empirically outperforms prior state-of-the-art on several benchmark datasets.
DialogueRNN: An attentive RNN for emotion detection in conversations
★
Emotion detection in conversations is a necessary step for a number of applications, including
opinion mining over chat history, social media threads, debates, argumentation mining, understanding
consumer feedback in live conversations, and so on. Currently systems do not treat the parties in
the conversation individually by adapting to the speaker of each utterance. In this paper, we
describe a new method based on recurrent neural networks that keeps track of the individual party
states throughout the conversation and uses this information for emotion classification. Our model
outperforms the state-of-the-art by a significant margin on two different datasets.
Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances
★
Emotion is intrinsic to humans and consequently, emotion understanding is a key part of human-like
artificial intelligence (AI). Emotion recognition in conversation (ERC) is becoming increasingly
popular as a new research frontier in natural language processing (NLP) due to its ability to mine
opinions from the plethora of publicly available conversational data on platforms such as Facebook,
Youtube, Reddit, Twitter, and others. Moreover, it has potential applications in health-care systems
(as a tool for psychological analysis), education (understanding student frustration), and more. In
Addition, ERC is also extremely important for generating emotion-aware dialogues that require an
understanding of the user’s emotions. Catering to these needs calls for effective and scalable
conversational emotion-recognition algorithms. However, it is a difficult problem to solve because
of several research challenges. In this paper, we discuss these challenges and shed light on recent
research in this field. We also describe the drawbacks of these approaches and discuss the reasons
why they fail to successfully overcome the research challenges in ERC.
Factorized multimodal transformer for multimodal sequential learning
★
Abstract:The complex world around us is inherently multimodal and sequential (continuous).
Information is scattered across different modalities and requires multiple continuous sensors to be
captured. As machine learning leaps towards better generalization to real world, multimodal
sequential learning becomes a fundamental research area. Arguably, modeling arbitrarily distributed
spatio-temporal dynamics within and across modalities is the biggest challenge in this research
area. In this paper, we present a new transformer model, called the Factorized Multimodal
Transformer (FMT) for multimodal sequential learning. FMT inherently models the intramodal and
intermodal (involving two or more modalities) dynamics within its multimodal input in a factorized
manner. The proposed factorization allows for increasing the number of self-attentions to better
model the multimodal phenomena at hand; without encountering difficulties during training (e.g.
overfitting) even on relatively low-resource setups. All the attention mechanisms within FMT have a
full time-domain receptive field which allows them to asynchronously capture long-range multimodal
dynamics. In our experiments we focus on datasets that contain the three commonly studied modalities
of language, vision and acoustic. We perform a wide range of experiments, spanning across 3
well-studied datasets and 21 distinct labels. FMT shows superior performance over previously
proposed models, setting new state of the art in the studied datasets.
MELD: A multimodal multi-party dataset for emotion recognition in conversations
★
Emotion recognition in conversations is a challenging task that has recently gained popularity due
to its potential applications. Until now, however, a large-scale multimodal multi-party emotional
conversational database containing more than two speakers per dialogue was missing. Thus, we propose
the Multimodal EmotionLines Dataset (MELD), an extension and enhancement of EmotionLines. MELD
contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends. Each utterance is
annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities.
We propose several strong multimodal baselines and show the importance of contextual and multimodal
information for emotion recognition in conversations. The full dataset is available at
http://affective-meld.github.io.
Multi-task Learning for Multi-Modal Emotion Recognition and Sentiment Analysis
★
Related tasks often have inter-dependence on each other and perform better when solved in a joint
framework. In this paper, we present a deep multi-task learning framework that jointly performs
sentiment and emotion analysis both. The multi-modal inputs (i.e. text, acoustic and visual frames)
of a video convey diverse and distinctive information, and usually do not have equal contribution in
the decision making. We propose a context-level inter-modal attention framework for simultaneously
predicting the sentiment and expressed emotions of an utterance. We evaluate our proposed approach
on CMU-MOSEI dataset for multi-modal sentiment and emotion analysis. Evaluation results suggest that
multi-task learning framework offers improvement over the single-task framework. The proposed
approach reports new state-of-the-art performance for both sentiment analysis and emotion analysis.
Sentiment and Sarcasm Classification with Multitask Learning
★
Sentiment classification and sarcasm detection are both important natural language processing
tasks. Sentiment is always coupled with sarcasm where intensive emotion is expressed. Nevertheless,
most literature considers them as two separate tasks. We argue that knowledge in sarcasm detection
can also be beneficial to sentiment classification and vice versa. We show that these two tasks are
correlated, and present a multitask learning-based framework using a deep neural network that models
this correlation to improve the performance of both tasks in a multitask learning setting. Our
method outperforms the state of the art by 3–4% in the benchmark dataset.
The Nitty-GRITties of Success: Computational Analysis of Grit From Language
Grit is a prominent personality trait that measures an individuals passion and perseverance for
long-term goals. Grit entails that zeal and persistence of motive play a key role in determining an
individuals success in the long run, as opposed to natural talent. But how does one identify and
distinguish between gritty and non-gritty individuals? Do they use language differently? In this
paper, we seek to answer these questions using a social media setting. We build a new crowd-sourced
Twitter corpus that contains the posts of 464 users along with their grit scores, and explore how
grit correlates with other major behavioural and personality traits. We then train machine learning
classification models to predict the grit level of an individual using his/her Twitter posts, and
show that language can be effectively used to infer this personality trait.
Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper)
★
Sarcasm is often expressed through several verbal and non-verbal cues, e.g., a change of tone,
overemphasis in a word, a drawn-out syllable, or a straight looking face. Most of the recent work in
sarcasm detection has been carried out on textual data. In this paper, we argue that incorporating
multimodal cues can improve the automatic classification of sarcasm. As a first step towards
enabling the development of multimodal approaches for sarcasm detection, we propose a new sarcasm
dataset, Multimodal Sarcasm Detection Dataset (MUStARD), compiled from popular TV shows. MUStARD
consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by
its context of historical utterances in the dialogue, which provides additional information on the
scenario where the utterance occurs. Our initial results show that the use of multimodal information
can reduce the relative error rate of sarcasm detection by up to 12.9% in F-score when compared to
the use of individual modalities. The full dataset is publicly available for use at
https://github.com/soujanyaporia/MUStARD.