The discipline of creating visual material has recently changed thanks to diffusion-based large-scale text-to-image (T2I) models. These T2I models make producing engaging, expressive, and human-centered graphics simple. An intriguing use of these models is their ability to generate various situations linked with an identity using natural language descriptions, given a specific person’s face in their daily life (our family members, friends, etc.). The identity re-contextualization challenge, which deviates from the typical T2I task illustrated in Fig. 1, calls for the model to maintain input face identification (i.e., ID preservation) while adhering to textual cues.
Personalizing a pre-trained T2I model for each face identity is a workable method. It entails learning to correlate a particular word with the essence by enhancing its word embedding or fine-tuning the model parameters. Due to the per-identity optimization, these optimization-based approaches could be more efficient. To avoid the time-consuming per-identity optimization, various optimization-free methods suggest directly mapping the image characteristics obtained from a pre-trained image encoder (usually CLIP) into a word embedding. However, this compromises ID preservation. These techniques, therefore, run the danger of impairing the original T2I model’s editing skills since they either call for fine-tuning the parameters of the pre-trained T2I model or changing the original structure to inject extra grid image characteristics.
To put it simply, all concurrent optimization-free efforts struggle to maintain an identity while maintaining the model’s editability. They contend that two problems, namely (1) the erroneous identity feature representation and (2) the inconsistent aim between the training and testing, are the root causes of the abovementioned difficulty in existing optimization-free studies. On the one hand, the fact that the best CLIP model at the moment still performs much worse than the face recognition model on top-1 face identification accuracy (80.95% vs. 87.61%) indicates that the common encoder (i.e., CLIP) utilized by concurrent efforts is inadequate for identity re-contextualization job. Furthermore, the CLIP’s final layer feature, which largely focuses on high-level semantics rather than precise face descriptions, fails to maintain the identification information.
The editability for the input face is negatively impacted by all concurrent tasks using the vanilla reconstruction objective to learn the word embedding. To address the difficulty mentioned above of identity preservation and editability, they provide a unique optimization-free framework (named DreamIdentity) with accurate identity representation and consistent training/inference aim. To be more precise, they create a unique Multi-word Multi-scale ID encoder (M2 ID encoder) in the architecture of Vision Transformer for correct identification representation. This encoder is pre-trained on a sizable face dataset and projects multi-scale characteristics into multi-word embeddings.
Researchers from the University of Science and Technology of China and ByteDance suggest a novel Self-Augmented Editability Learning method to move the editing task into the training phase. This method uses the T2I model to build a self-augmented dataset by generating celebrity faces and various target-edited celebrity images. The M2 ID encoder is trained using this dataset to improve the model’s editability. They made the following contributions to this work: They argue that because of their erroneous representation and inconsistent training/inference objectives, existing optimization-free approaches are ineffective for ID preservation and high editability.
Technically speaking, (1) they suggest M2 ID Encoder, an ID-aware multi-scale feature with multi-embedding projection, for appropriate representation. (2) They incorporate self-augmented editability learning to enable the underlying T2I model to provide a high-quality dataset for editing to achieve a consistent training/inference target. The effectiveness of their approaches, which effectively achieve identity preservation while permitting flexible text-guided modification, or identity re-contextualization, is demonstrated by comprehensive studies.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
The post Meet DreamIdentity: An Optimization-Free AI Method for Each Face Identity Keeping the Editability for Text-to-Image Models appeared first on MarkTechPost.
The discipline of creating visual material has recently changed thanks to diffusion-based large-scale text-to-image (T2I) models. These T2I models make producing engaging, expressive, and human-centered graphics simple. An intriguing use of these models is their ability to generate various situations linked with an identity using natural language descriptions, given a specific person’s face in their
The post Meet DreamIdentity: An Optimization-Free AI Method for Each Face Identity Keeping the Editability for Text-to-Image Models appeared first on MarkTechPost. Read More AI Shorts, Artificial Intelligence, Computer Vision, Editors Pick, Staff, Tech News, Technology, Uncategorized