Okay so I think we should start.
It's my greater pleasure to have Gan Chuang here.
He's interviewed for a position in my group and he comes with
a rich communication background, his adviser is Andrew Yao.
I don't know how you could pick up a research topic which is
totally out of your adviser's domain of expertise, but
Chuang has a well rich experience.
He had did a widget in research student enrolled in Stanford,
Tsinghua, and also did a bunch of internship reimburse in both
Microsoft Research Asia and Google Research.
So, Chuang, the floor is yours now.
Look for it right.
>> Okay, thank you Gan for the introduction and okay.
Today it is great honor over here to present my PhD work,
with understanding from tags to language.
And currently we are in era of big multimedia data.
In a single minute,
they're about 300 videos uploaded to YouTube.
Then the natural question we will arrive,
how can we organize this large amount of consumer videos?
Okay, let's first take a look how YouTube attributes goal.
YouTube conduct video search based on the keyword match
between user queries and the video metadata such as title,
description, or comments.
Yes, this is exactly a very effective way and
efficient way to conduct a video search.
But there also is some problem for the video metadata.
The first one is the video metadata could be noisy or
irrelevant.
Since the user uploads a video to the website,
we cannot expect them to give a perfect title that exactly
matches the user's searching behaviors.
And secondly, we also observe there are a lot of content
on video that don't have the video metadata at all.
So, both of these case, we do certain we will be fail.
And it will be a pricey need to conduct content-based video
analysis.
And my research focused on understanding human actions and
the high-level events from Web and consumer videos.
Particular, in this talk, I will first introduce my efforts,
like designing effective network architecture to learn a robust
video center representation.
My paper published in CVPR 2015 is about the pioneer work,
the active learned work,of true success on the video
recognition task.
And we also provide the best model TRECVID multimedia event
detection competition, and also this year,
ActivityNet challenges.
We also provide the single best model on the YouTube 8
Million challenges.
Even though super active during our work for reasonably good,
performance for the video recognition, but we cannot
expect this kind of super active learning can scale up.
So in the second part I will introduce our effort that
learning video recognition from weak supervision.
First, I will show how web crowd, web images in the video,
can replace the human knowledge example to conduct
video recognition.
Now secondly, I will show that how we propose a new measure to
conduct zero-shot video recognition but
connect it with knowledge base.
At the end of my talk, I will introduce our most recent effort
that connects video understanding with the language,
through attribute visual captioning and
the video question segmentation.
Okay, let's first start from the video recognition part.
Currently, there are three challenges for
the video recognition.
The first there are that data has large interclass variations
and also inherently is very complex.
Make video recognition become super challenging with subtopic.
And secondly, it's very intensive and
a lot of time consuming to notate a video.
This is all the reasons why current video recognition
data science, all comes treated to handle classes,
much smaller than the email data.
Certainly we also realize there area wide number of video
concepts, because video concepts always consists of actions,
thing, objects.
So the combination is very complex with the video will
carry through product.
To protect the first challenges is we propose as we learning
video representations using Deep Neural Network.
Okay, so
the problem we have to solve is that given a video for testing,
we not only have to give it a event label such as this video.
We name it as a tenable trait and of course we also want to
localize the spacial-temporal key evidence.
For example in the search needs
to be contained in the background.
Also it's a pack deal.
It's a pack deal instead of a background window because
believe me it's a timing back trait.
And to achieve this goal we propose a trainable DevNet work
and out of DevNet work I create, it's there but I see end it from
work, we first pre-train the network using the image data and
then we fine tune it on the video data side.
However different from the existing approach that he takes
by that nature, video classification as an image color
scanner, which I mean that is he takes the key frame in and
the fifth floor of the network.
And finally they upgraded, so let's go over all the key frame
tools to achieve a final redirect key chain.
We see the video is a natural hyper temporal dimension,
our mutual tag and language information contained in a video
to achieve the video clarification.
So the first time your Internet can be proposed,
which video from a single image to model key frame we
call that actual video segment.
Then we put the video segment off input of the network.
As the second contribution that we proposed a learnable feature
of which and
layer here is we naming is the cross-frame max pooling.
That's to learn how to create the first orientation from
multiple key frame to a single writer and
then input key frame for that location.
Equip with the two technique contribution and
we achieve the passed the dot and
the tri-media multimedia event detection data size.
I want to mention that we boost the previous stop by red hills
40%.
And the traditional problem of input into artificial vector,
it told me the video in here for several years.
And our paper is about the first to said on
the Deep Neural Network that beat share of recommendation.
Since this is also apparently where I leave off I welcome
understanding why Deep Neural Network, can achieve with so
good results..
So, we visualize this spacial-temporal setting to mic
of the understanding what actually the Deep Neural Network
learned to localize during the training.
We find that it definitely looks like some virtue tool,
very interesting,
we definitely have some very interesting information.
For an example the attempting a bike trick,
if the region with the people right back have a high
response region and also for the dog throw,
the top region also have higher sentence to go.
Another very interesting observation is about the fourth
event for them, the playing back, playing fetch.
Actually, from my understanding,
I don't know if the person more important or
the dog more important there to distinguish playing fetch.
And also applying of shell since the detection approach in
the right most, it told me that a person is more important.
It's about more saliency but when I put them into key frame,
it told me that the dog but
if we need to disseminate the key effect.
So there's somehow indicate that the machine,
understanding the resources and
we somehow now toy around with our human communication system.
Even though this paper published two years ago has already
received quite nice attention, for example,
I want to mention to work upon the video network having done.
And the first work is the winner of last year ActivityNet
challenge, the main Temporal Segment Networks, and
then in the segment of pulse, sorry.
>> [INAUDIBLE] >> Sure.
>> The last one, they used to speak in the whole,
are you somehow against some shot from the video but
that's before you feed it.
Yeah answering that question we definitely first decompile
the video with the different segment, and
we only extract the middle key frame as the input into
our network to save computing time.
>> How do you decompile >> Yeah,
well actually it would be based on the kind of the change.
And that means we can detect the boundary, and
then we pick the meter here.
>> So if the decompile is very bad,
that's the effect of the final resolve of the key frame?
>> Yeah.
There's many papers talk about how we can
be clearer to the beta segment.
In the from our region,
it definitely will, you put >> Are you
using emotional information or static cues?
>> In this paper we do not use emotional information and
in the next slide I will introduce something about how we
frame with most information.
So let me continue my talk first and
actually favorite part which we are and around Mac in two work.
The first right camera segment work in the beginning of last
year I kid you not we do crisis challenging.
And reading this technical report the color indicate that
they are satisfied about my first tiny contribution that
he used in the video segment.
Instead of key frame has improved to improve the
traditional tertiary network, and one step further of that
thing also applies to measure, to obtain a float, the input.
I mean, you had a question about the modeling of information.
And the second one I want to mention is the winner of 2015.
Too many changes.
Actually they have the same idea of my second technical
contribution, they just started off directly averages the summit
education.
They're also using efficient coding,
but instead of using our method, they use
a more advanced method of coding to do the future vision.
And also our work has received attention in the video
localization, and also the network interoperability.
And I think you will hear more interesting about the recent
progress, and in this year we've been there, so
I'm not sure how many people know the ITT night.
Okay, they're actually it's more like a figure of the and
the year, they won the competition this year but
also take part in the competition.
And here I want to share you something why we can win
these challenges.
Some technical information.
But first of all I want to show you the proposed super deep.
Because our original observation is that the big LSTM for
video recognition is not useful, and
then we took this conversation from some small data set.
Such as UCR one,
MT it means that there's only 10,000 videos here.
And we speculated route, the therefore deeper RTM,
maybe from two reason.
The first of them maybe the dataset not enough and
now we don't need that big model to capture the data.
And second it may be that we're not clarity turning the network
architecture to make it deep.
So to figure out what it entails different RPM in the world for
video recognition we first tacked on the YouTube
eight million there, because the emergence of YouTube eight
million and also the large scale ITP net.
That, so that they, because they have a multiple just how
statistically a bigger set of video that they provide our,
our opportunity to reveal our convolution.
That when we now increase the tabs of the RPM,
it's still until bad performance.
And they're telling me the find is actually a maker for,
because we, we expanded the rhythm there for the people are,
the people are working on the harmonic recurring unit and
make the bar, bar right as it's created.
It's not worthwhile.
So here, we propose that we add shelves,
we analyze fast four connection between two recurring units.
And with this clarity,
we found that the model can work faster and knows better and
into a better performance than the shadow network.
And the second one I want to show is about we proposed
a purely attention network because we are inspire by
the recent successes from the Google brand, they say that it's
purely using the performance on the translation.
And we tried the same, it's a very similar idea,
on the temporal mode, we definitely observed guys,
only using the temporary cells RPM.
We really, it's also a really good performance.
A different further from proposing a shift in
the temporal layer, it verifies but we introduce some
normal parameter here and we hear about sixty-four.
And to learn a parameter, for
that to learn that works with a tangent here and
then we command them for better redirect.
And the third model we propose is about
using the temporal CNNs networks to replace the audio camera and
other recurring units.
The idea here is that we find that it's also you spare by
the reason to sell that they propose using kind of sequence
to replace the camera.
And we have the similar data in that, you simulate using
the count of to go through the frame.
It is really good compared to the camera framework.
And last one I want to show you exactly, also revisit measured
when we train the network for video recognition.
So in tutoring here, we find that for video recognition,
we do not using all the video segments for
the video investigation.
We only need a small work site.
So here we propose that during the competition,
we denote that probably the all the key frame with this element.
So like examples, it could be you and
also there are some diversity constraint to
bankroll with the segment to do the front-editing.
And we found that if you entail better to work and
also better performance.
Here I introduce others about how we design a network for
the video recognition but I'll
start at the beginning so that we can it video recognition.
So in the second part we introduce how can we learn
redirecting here is that here in that I put a query
on Google Image search engine.
Google YouTube will be the search engine that we fund.
The retruning images video are always quite irrelevant
to the search query.
So great that using this Google Chrome images and the video to
replace the tool even at work for the video recognition if we
have a tool, it'd be very cool to scale up the video.
With our contract, even though this is a really cool thing
about the For the web video if you start to have problems
with the unfocus any of that you can see the first symbol?
The mopping floor,
most of the video frames enough about the mopping floor only
a small part in the middle is it belongs to the mopping floor.
So video and trim only a small part of the frame
are relevant to the actual expense.
And the video images has the benefit of it because it always
capture and it also contain the highlight of action but
it may suffer from problems and also the YouTube.
You can observe the second example for the juggling ball.
So we can see the image of juggling ball.
So we can now process the images in one way on the video domain.
And for the third row we find that's the baby crow by that
designates the baby crow.
But they are in the right background.
[INAUDIBLE] With [INAUDIBLE] So we can now impact,
as a such clear image on the [INAUDIBLE] Video domain.
So, it seems very difficult to remove the nodes similarly
we're using [INAUDIBLE] Because [INAUDIBLE] Learning,
we do not have any guidance here.
But we have a key innovation here that we found out even
though we cannot separate remove the notes from web images, and
web videos, but if we can we can have some chance.
So the interesting thing is here that we found, that relevance
of images on the video frame are typically visually similar.
But there, the relevant part has their own distinctions.
For example, that for the nodes from the web images,
they have some logo of them Local images,
it's now always occur in the middle frame.
And then so
now as part of middle frame we always some background for them
With not occur in the I mean, in the bad image on those part.
So, it seem that the problem [INAUDIBLE] So we can
do the mutual filter in that [INAUDIBLE] So similar part.
[INAUDIBLE] steam power [INAUDIBLE] Together.
Best of this innovation, we propose [INAUDIBLE] Network.
And the first, take the video keyframe
of the input out [INAUDIBLE] Leading that work.
And this step works to [INAUDIBLE] Ability to
distinguish between the relevant video frame [INAUDIBLE].
But it has some benefits,
that it can be used to distinguish relevant images.
Because, I tell you that the nodes part of, I mean the nodes
part from my video and [INAUDIBLE] Are very different.
So the [INAUDIBLE] Highlight, and
also relevant to the query will be [INAUDIBLE] So
we can apply leading network to the images and the [INAUDIBLE].
We can base it on the protocol to set some thresholds, and
then The a highly related web images.
And since our images always contain the highlight, so
you can be pulled into the network.
It's in the network.
And in the network, you have the ability to localize the action,
and it then can be back to trim the video frame.
So the iteration both of narrows of the web images, and
this frame are removed, so we can send trim to temporary model
to model temporary information.
Here, we show them lately starting comparison results.
Further result of, when we first train our images,
are we Around 50% accuracy?
If we train our We will improve our what is over 10%,
it means that Is very useful for the recognition.
And we also compare with the For other For example,
between two model and we use the To model.
If the other Single using this model.
Another which we try to measure [INAUDIBLE] Image and
video together and network.
And it also [INAUDIBLE] Is still, I mean the performance
to still [INAUDIBLE] We first turn on video frame,
and then we using the video [INAUDIBLE].
>> Do you see [CROSSTALK] >> Yes.
>> But you know videos typically will have [INAUDIBLE]
Q hich [INAUDIBLE] don't have- >> Mm-hm.
>> So, did you try changing let's say the early filters,
which are most specifically [INAUDIBLE] Rather than mixing
them up?
>> Yeah, this is why I separated the Of the noise field,
noise mixing, and also the mixing.
The noise mixing, that will be just for.
And I mean,
the main thing that we also apply our filter emitter.
And then we mix all the image With the neural network.
And you can see that Also, is how batteries are.
And then, the problem that you mentioned that create beautiful
images and videos have a lot of labelled noise, right?
>> Yeah. >> So,
it's not clear to me how you're addressing that problem,
the packages?
>> Okay, this is what I described with the ladies
network, they were general understanding here.
So, the general Said noise of the video and video app and
mixed the irrelevant part of a different delay, but
the relevant part is still similar.
So this thing is that your Network that you see some
similar example, they will have higher detections to go, right?
>> Okay.
>> And same, the relevance about images may be So our images is
very similar appearance of the video frame, we will be wrong
here, and it's not a very similar part, we will run lower.
So then, we can separate the size of the hold,
to remove out the nodes of images.
But the,n the relevant, that relevant image we selected.
But, the images have to benefit it,
when we said they are always highlighted in red.
So, you can put them into the network to further
fine tuning it.
So, it is seems to more highlight our images, so
the next program,
you have the benefit to distinguish your highlight.
Things can be back to test on the video,
to like the top runs the video highlight.
>> Kwan? >> Mm-hm.
>> So when you the test the work chrome data to find related
data, do you some attention, or you test the diary bunch?
>> So this And you know Went public
We are not using We are just using For attention.
And while we definitely find some thorough artwork,
I think probably I am going to take a measure.
Yeah I think [INAUDIBLE] >> Also you can teach some
positive data mining.
Did you try to put a repeat this root back into
hash [INAUDIBLE] Collect data and retrain?
Did you find any benefit to repeat this process more than
one times, of mining posting data?
>> Yeah, it's a question Because in my paper, I only do one
iteration With multiple iteration [INAUDIBLE]
Yes, actually that question is
also asked by a reviewer is also asking me this question.
We definitely found that during multiple iterations,
it held maybe 0.2, some improvement, but big thing,
I think, also have some Curve [INAUDIBLE].
>> Sure, also there could be a perimeter data set bias, right?
Between the images and videos, which you are using for
doing the big Version?
So, have you tried training on one kind of data set, and try it
on a completely different data set to see the generalizations?
>> The [INAUDIBLE] Yeah,
this is actually what the results [INAUDIBLE] Here.
I want to [INAUDIBLE] Here so it's we also tax on [INAUDIBLE]
That we call all the video from [INAUDIBLE] Parts of data,
because it's more like [INAUDIBLE] Inquiry.
That means that, there have some We call it the testing video
in a secondary way.
And we also compare it with performance with other method,
we also find that it definitely much improved the results using
the legacy network.
I think for the time [INAUDIBLE] To another part,
we can talk about it, but let's talk offline, yeah?
Okay, even though the Port can achieve reasonably good
performance on the video recognition, but we cannot That,
we can turn the On the web data to I mean to become
[INAUDIBLE] video concept before hand.
So this is where I want to address problems of how can we
achieve some zero-shot video retrieval.
Is that how can we learn new video concepts without any
passive data.
Like how can that be fair?
And the problem is that, you came with a tactic query here,
and also you came with the video.
I'm gonna them with the and
maybe they're 20 [INAUDIBLE] okay maybe return each of that.
Each return to indicate [INAUDIBLE] video relevant to
the query.
So it's more like a [INAUDIBLE] approach and the traditional
approach so it probably that he [INAUDIBLE] the video and
the [INAUDIBLE] query into [INAUDIBLE].
[INAUDIBLE] to retrieval.
I agree this is very [INAUDIBLE] the retrieval based approach.
[INAUDIBLE] because you cannot
leverage in the inter [INAUDIBLE] category.
And the means video content category.
And to address the problem want to recognize the new
video content category with the soccer penalty,
even though we do not have a field hockey penalty typing
that if a video very similar to both field hockey penalty and
the soccer it will very likely will be the soccer penalty.
In other word, if a video have both a filter parity and
a software driver,
it will be very likely to be the software parity.
So then the problem return to with new video content recovery,
how can we identify the rate of from a content pool.
And to address the problem, when we're using a data driver to
is that we can many similarities between two concept names.
Let's give a concrete example.
For example, if there's a no load after a and
there is also a lot of between the front
with many pretty fun concepts and then we would have a ranking
score of all of the video concept names.
And then for this run we will find that the highly relevant
content would be [INAUDIBLE] and a big draw.
And the same thing with trending and ranking function that makes
the video very similar to the and run the higher.
And as some of videos something so
we will recognize the new concept.
And by the technique that we or to the best result on
multimedia event and the only secret I repeated with
is that we collaborating the relationship with instead,
the other team always trains hard to improve the, so
we're doing So we just quite different.
Credit a different domain.
Okay, let us go to I want to introduce some a more
interesting thing, a thing about [INAUDIBLE] but
first of all I want to introduce about is visual [INAUDIBLE]
it's about generating attractive caption.
Okay, The problem about the [INAUDIBLE] you [INAUDIBLE].
A lot of the [INAUDIBLE] becomes that many people
doing image captioning and many people will argue,
what's the real implication for the video captioning.
And we also find that the [INAUDIBLE] system always
generating a global description of the [INAUDIBLE] content.
[INAUDIBLE] and after that, instead of sending an email
to the people [INAUDIBLE] use smartphones [INAUDIBLE].
We can say that [INAUDIBLE].
For example.
You know make me smarter.
Pretty girl don't allow me.
So we don't now [INAUDIBLE] is more attractive to me and
also we find [INAUDIBLE] an another benefit is that.
You can see the images.
The images become popular,
not only because of the images content.
It's also because they have a perfect typo.
So and also another [INAUDIBLE] for
the demo is when I upload the images.
To a social media such as Facebook or Twitter.
It would always take me a long time to figure out or try to
type in a title to majors to attract more people to like my.
So if a machine can automatically it would be very,
very useful.
So and then we also find that using [INAUDIBLE] is
a kind of method [INAUDIBLE].
And this will be with [INAUDIBLE] generating
[INAUDIBLE] captioning.
And the methodology of our approach is that we propose
a [INAUDIBLE] and actually the [INAUDIBLE] is very easy.
[INAUDIBLE] We did compose the [INAUDIBLE] of
ITM in [INAUDIBLE] USV among the first
mentioned USVs don't equal to the [INAUDIBLE].
And so what we want to do with that during the caption
generation we want to sigh read the content part and
also the readers their part.
So we hope UNV Show the content of the using the to control
the content for images.
To do the steps.
And to achieve this goal images and the factual caption.
I mean if the and then we also have the monolingual romantic
sentence, a humor sentence that I am going to cover.
Since that is very hard to gather the [INAUDIBLE]
of the romantic sentence or humor sentence, so
we want to achieve a goal [INAUDIBLE].
So [INAUDIBLE] we want to [INAUDIBLE] general comments.
During the [INAUDIBLE] I met the first [INAUDIBLE] images using
the first here to generate [INAUDIBLE] is trigger including
the [INAUDIBLE].
In the folder for the second [INAUDIBLE] we were using
the [INAUDIBLE] to do the [INAUDIBLE] modelling.
Is that you came in on the fourth war to predict the second
war and so on.
And also, the third actually is also on the humor sentence.
Tomorrow the and with humor sentence.
And during the multitab we have the training.
We share the wight of U and the V among different tasks.
And S is by five.
You inform that the U and the V to control the content amidst
to [INAUDIBLE] on the image capture generation and
I still control this step.
And during the caption generation we can
just switch the style [INAUDIBLE] of s to generating
the [INAUDIBLE] whatever we want.
So if we want it generating to understand we can use
if we want to generate into the center we can use SR.
And if we want to generate a humor sentence we can use SH.
And let's see some concrete example, and
this is a caption generating by the caption bot.
It's about a dog laying in the grass.
And our romantic sentence will be a dog
running through the grass, to meet his lover.
And also the humor one will be a dog running through the grass in
search of the missing bones.
And luckily [INAUDIBLE] demo for the we will say some bonus
sentence like a man on a rocky hill next to a stone wall,
it seems really boring.
It will be romantic because you can say a man is rock climbing
to conquer the high.
And also the you're climbing the rock like a lizard.
And then we also show that our and data be to the domain.
So then, for for this video, maybe a man is playing a guitar.
And the romantic will be a man practices the guitar,
dream of being a rock star.
And the humor caption will be a man playing a guitar,
but re runs away.
So inside the caption is more attractive and
it also makes the image with the also more popular.
>> But the only way you can generate this is from
the planning data.
You have some forms of this, right?
>> Yeah, that's a really good question.
>> So you're not being creative, the is not being creative,
then it's copying things, right?
>> Yeah, so, of course, I want to say,
it doesn't mean I training,
it definitely characterize some kind of a [INAUDIBLE].
But I don't agree that if [INAUDIBLE]
copy the sentence down because we definitely find that
it means the [INAUDIBLE] just learning for them.
I mean people always using [INAUDIBLE] to
model [INAUDIBLE].
And let me say it's more likely modeling actually
modeling the [INAUDIBLE] and how people always make [INAUDIBLE].
>> Wouldn't it be interesting to see this generated?
>> [INAUDIBLE] you should go back to your training data and
have some semantic distance between your input captions.
What were the closest ones indicating data?
That only makes sense to understand what you
are doing here.
>> Yeah, yeah it's been that question.
And first I want to tell you that because we do not have pair
images of grammatic sentence- So
this means that you cannot use [INAUDIBLE] approach
to achieve this goal because you don't have the current images.
And we also definitely find that for
some language pattern actually it's [INAUDIBLE] and
if you generate it means that [INAUDIBLE] will be repeated.
We [INAUDIBLE] that we want to instill that how people
[INAUDIBLE] how they make them into [INAUDIBLE] consumers.
So we also want a machine learning to be humorous
[INAUDIBLE].
>> Yeah, are these two standing datasets,
are they separate datasets that you have labelled?
Some book or- >> All right [INAUDIBLE].
Yeah, actually which both that would be the first.
Yeah, it's something that both.
And also there's people that they have some
humorous sentence helper [INAUDIBLE] on this.
And the interesting thing is that we labeled the text and
data because we want to do the best quantity and
quality evaluation.
So we do some evaluation for the human personality.
>> So just a practical comment here,
it's very difficult to be funny where you want to.
It's much easier to be funny,
like your romantic set is all funnier.
>> Yeah, I agree that it's more like the other.
>> So you shouldn't really be doing anything humorous data
set, you shouldn't even do the romantic data set,
you should do like Donald Trump.
You should do things that are much more person
style that stand out.
>> Yeah actually we do not want to differentiate difference
[INAUDIBLE] because as you can tell you can see our table is
about [INAUDIBLE].
We do not [INAUDIBLE] but we want to do is if we can generate
the more interesting caption.
So during the evaluation we also do a very interesting
[INAUDIBLE].
That would be there reading with we're dealing
with the major here.
And then we run the machine for
both our current reading from other system and our system.
And we also use the one position.
If you want to upload the images to the website,
which caption do you prefer?
So this time,
how to reflag that- >> You have a few
options for [INAUDIBLE].
>> Yeah, yeah.
And one thing we've done which is related to this is to,
just for question and answering systems,
to make the answer sound like it comes from a certain style.
>> Yeah.
>> And one thing you can do is movie scripts like we did
the Star Wars >> Yeah.
>> And it's pretty hilarious what comes out,
you can give all these possibilities.
It'll be kind of a cool tool for social media.
>> Exactly, yeah, I agree.
>> Yeah, I want to readdress
this question about the sentence copy vector.
I think the projection of the english caption is more truly
based on the [INAUDIBLE] exactly what you are suggesting, but
with RSPM, you have the chance to combine fragments of
sentences from one sentence, and
the other from the other sentence.
>> Right, yeah.
>> But my question here about really this kind of style I
feel is much better [INAUDIBLE] position this
problem in the conversation and in the dialogue.
Because, really, you need to dig into other factors Into
a consideration like what's the question,
what the other person's personality is?
>> Yeah, maybe some conditionls or
something [INAUDIBLE] >> Yeah, there could be some
other factors like even considering social network is
actually loosely connected like a dialogue, right?
Like if you are making comments some people respond
maybe two days later, not immediately.
>> [LAUGH] >> But in chat bot scenario,
you need to take those contacts sort of immediately.
So this is from working
in collaboration with Shell company?
>> Yeah, this is the next time, last time industry protect.
>> It's all right, [INAUDIBLE].
>> Okay.
>> My other question is, you know if your USVD composition.
>> Yes.
>> It's- >> Yeah.
>> It's emphatically ambiguous, right?
I can multiply by a fine transformation [INAUDIBLE] so
how do you regularize it?
>> Yeah, you have to [INAUDIBLE] and
also ask [INAUDIBLE] multiple times by presenting it to work.
And other people, okay, let me come back to the slide.
I think we're gonna have enough time, yeah,
maybe we can come from the first StyleNet here.
Many people may ask that where do you [INAUDIBLE] they came up
with two metrics.
Yeah, let me you have your thing you and
I control the content as style or.
And we definitely we definitely two factor and
we find the performance of the three part
is from the performance [INAUDIBLE] metric.
This because when you're in the video [INAUDIBLE] how
a ladder of social space and it will interpret robust results.
And in the [INAUDIBLE] we're controlling the style using
the region, the first one is very, very simple.
And also it's a wild, you utilize in the style and
content control in the region for them,
where there are background, and well,
there are some method that, using matrix saturation that
can separate the content and the motion part.
And we borrow that here, and
we also, we make it work [INAUDIBLE] whole story.
Okay, the next thing I have to- >> I have one question.
>> Sure. >> So if s, you conjecture that
s is encoding style in this factorization, and
you can test that, right?
So you can put in an input and
then actually change the values of s by hand and
see when you get a different- >> Yeah, that's really, yeah,
it's really quite true.
Actually this is [INAUDIBLE].
0.5, plenty of things humorous.
I want to say, in this kind of framework because there's no
control of this, this kind of metric cannot work.
But we have another portion, so that would be why we multiply
something that definitely can do some mixed stuff, and
also really cool stuff, so that's why the item powers
in this sub and under some mention here.
>> For that part, I seem to have perhaps to person,
that's more interesting, make some adjustments.
>> Yeah.
I agree yeah.
It's prudent to do so.
It also could reflect how somebody can control itself.
It tells I mean it to how much that we want to
knowledge control measure for that yeah.
Okay so it's fine.
I also want to do another example and
also that actually to also work with Xiahou Dun, and because
last summer we also talk about capillary adjustment system.
You want to move deeper understanding of
the email content.
For example, stand that with mine,
but with more entity recognition approach.
For example, he had a really good,
easy recognition he could recognize for them is Obama.
And we could say that the person standing there is Obama.
But we can do one step further,
we can combine knowledge bases to do some reasoning for them.
Obama is a person of the Democratic party and
competitor of the Democratic party is the Republican party.
And the motif of the Republican party is the elephant,
so we can come to a very dependent sense.
Obama is not the Republican chair either.
So this is reasoning for the next,
because it means the first, maybe just ten years, but
but if we could generally give you,
generating the multi-benefit content.
Okay I think at the intro I will introduce
my most recent ICC paper that is how can
we best collaboratively draw the modeling vision and the language
through the new type of visual question segmentation.
And it's a problem we want to sort it out for them,
we have an image and we have location, I think very well,
I think most of you, it's very familiar with the traditional
visual captioning system, I mean you'll get a question about
images, if you were generating a answer about the images.
And, they will argue that there will be a credible answer,
it's a very important step.
You've got to measure the image to both understanding
the imagery content and also the natural process.
So, it's a very significant step for
the geometry modelling on which our image.
How would we argue that?
Only providing the text,
it's not enough to understanding the image content here.
So we are gonna insert location text,
we should also provide the visual items or
the visual segment to help you answer this question.
So for example it's the second image box,
it's a straight assembly, if you just answer yes or
no actually you have 50% to get the right answer right?
But no one know that you fully understanding you made it for
them, so you found there are total parts here.
And as a variant there,
you also do not know that if the answer is yes or answer is no.
You don't know that machine.
Definitely no words are needed if we can
localize answer another visual,
definitely understanding that you made content.
And so, we propose that, yeah.
>> So, I've always said that we always have a little bit trouble
and we should question the answer,
problem with the parallel setting that is more of you
provide the database and you provide the question then
you ask if the system will generate the answer.
Is it even more Important to
be able to see the future- >> Yeah yeah.
>> The question.
This question setting like set people off like for example.
>> For example what's so special is always between ten years
if you're just understanding this print, 90% will be correct.
And for some criteria,
I think the time recording is also another argument for why we
want to do the visual is that if we only trust on language and
answer parody, we get a logarithm of good performance,
and it's not the implement when you look at the images.
So that's how I decide this is the reason that should be
different from the quantization,
which would put them all property of the images.
So I think they are in concurrently with that work they
are another worker I miss from the Reno mister big way too is
that they try very hard.
It means, to control the tongue to get an equation to generate
a remote that works, answer.
But we've always thought in a different way,
you also look at what's beside them.
I think it's another very easy way to evaluate, to highlight
the importance of image content analysis, so we can task.
And with this benefit,
we show that it can be a benefit to two fundamental tasks.
The first task we show that we can enable a type of sort of
question-focused semantic segmentation.
The question that you'll tell me you wish you could
answer that question.
And also kind of because the reason task calls will
be question, I mean question can include many
clues to unify all image tasks what the object,
what it is, sometimes as is.
It's more like the operating gate detection problem
in tunnel vision, and the how many function is more
like the semantics segmenting or using segmentation.
And also what is the man doing is the actual recognition of
the images, and so Also some actual augmentation,
function and also something about reasoning.
If you're married,
you can visualize the ring it means that he's married.
Also any commercial shopping localize maybe
around McDonald's.
So this is the first half.
Second, we also show that, with this segmentation mask,
we can do supervised attention for VQA.
It's that, during the trip,
because we the people during the attention is more like
black balls, that they will pressure, and
the machine will automatically obtain different region.
And also, there are many paper that it show,
the image is very reliable.
And here we show that if we can truly, I mean,
with the question, region, and notation and
the outside parallels, we teach the machine where to see when we
turn the molecule and
also significantly improve the performance.
Here I want to show you some result.
First of all, I'm going to show the result from the notation.
At the front table I want to tell you that the best method we
propose is still far away from the current estimating step
method result.
This means that it's a piece level image understanding from
a long way is still very very training hard hard.
We need more effort in this domain, and in the second part
we show that we only noted about 10% of the original visual
data set with the segmentation tools, and even you
can improve the performance by improving the file.
This means that, if we can teach a machine to see clearly,
it will do better for reasoning, and also to answer the question
or something, so I think it's a good start that means that to
teach a machine to see where it can and we can
first understand the image content to the career path. Yeah?
>> When you analyze the query,
the textual query,
do you take in the entire question as a single entity?
Or do you break it up into sub-queries and do different
imaging processing tasks based on what the sub query is?
>> Just reading that section.
[CROSSTALK] >> Verify for the answer.
So if I ask like where are the sheep in the picture
versus where are the sheep and the cows in a single picture.
Would you write that up into two queries, or would the machine
try and isolate that as a single query with two options?
>> I think it's the second one.
It's that we try both inquiries using some word play for
each one and the average of them.
And although, we try encoding the sentence into one single
encounter, we do not decompose the query for elegance.
But I agree with you that if we can, I mean,
if there is this view open with some problem, how can
we understand their language is tougher than I think is to do.
In kind of that view, it's probably with the second life
being encoded in your entries.
>> The way that you attack this as a database problem,
you sort of, you generate operators from text, and
you sort of create almost like a query plan through
the distribution, so the to think of like an image
processing plan that's sort of formulated the same way.
>> Yeah, I think it definitely might need research.
We can do about the language part, and
also many things we can do about the emailing part, so that's why
I say it's a very interesting time, how anyone can try this.
[LAUGH] Yeah.
Okay I think, okay.
>> So you mention that you can reasonably,
is this how many single creations are there,
is it also possible to measure dimensions of the image?
So for example,
what is the angle of the left part of things like that?
Is that something that's part of the students there?
>> So I think your question is about the answer of ambiguity.
Because for them, you ask this question for this one.
What's the natural result?
An answer for the question, almost I feel I miss what
indicated in the questions about the tower and
what the answer is a person.
So true, yeah,
doesn't remember label this data set we definitely found out and
let me have I mean very many we created.
So we define the rule that if we only allow that,
we only let the user to annotate the answer part instead of what
the question.
You only need to annotate it under,
this is the confusion.
And also during the labeling task, we also,
to answer that question, you can now define some region as
the particular region for the other.
And we also gave some answer and I told that they may say,
you need to see all the images as answer.
Or you are unsure, we should combine these.
So we would idea the software for certain, and
we want the data set to be more clean, and
also we focus more on the region, we understand different
relationship with anti relationship.
So we lower down certain criteria so
we delete it during the data collaboration.
Yea.
Okay, I think I should conclude my talk, and the first thing I
want to recognize is that my work is one of the first so
I ask of you for the last we do qualification.
And also we played with a new research direction that how we
can utilize the video image to do the video recognition.
And also we propose a new matter with connecting knowledge ways
to open and carry video retrieval, and fortunately we
take it a step further to try to video captioning.
And finally, we collect the new resource question data set
to facilitate a better drawing the modelling and
the additional knowledge.
And in the next time I'm very interest to doing
the following with our hobby, and
the first one is about how using meta-learning to design
the Neuro Network architecture to do the video analyzing.
Sorry because you'll know kind of like how all
the Neuro Network is really is the human advantages.
And we want to currently work this, working globally.
How can we use another network to improve the network and
to learning Neuro Network by doing the network.
And I think it's because currently there
are misunderstanding in the glocal of doing this
that many people want to do some deep learning.
You only need to give a data again.
They will generally look good, and it will be very cool.
Secondly, I also want to do more work about Canadian,
understanding this storage base as because current evidence is
any still limited to the real livo, and we should keep
understanding of what's contained in the video content.
What's the cultural relationship.
I thought we would also want to understand the operating
humanity, the video, and the connection,
the deep understanding with the measure,
which I think he is about.
For example we watched a video about how we can stimulate
the action performed in the way video,
like we're wanting to stimulate the action.
And also very interesting about you assume the generality
models for the videos and for primitive essays, for
the future prediction.
And then lastly I will also want to treat the video as
a knowledge base, and how can we leverage this
in the temporal continuity to build the video for
learning with, so no difference.
So for prior learning, for no use,
I think there is many work we plowed here.
Okay I think that's all my talk.
I'm very happy you're here to listen to my talk.
>> [APPLAUSE] >> We're almost out of time, but
in case you have any questions it seems we can still help.
>> So when you choose which images to let through for
reprinting, do you have a different parameter setting?
>> Parameter setting, which part?
Do you mean- >> In the link to exceed part.
>> What?
>> Link to exceed?
>> Maybe you can repeat?
Yeah so- >> This supervised approach?
>> No, when we find too many of them,
>> Yeah when you do that-
>> They do
not return the network.
We just put the data in to further continue for that one.
>> Yeah yeah yeah, when you're returning the-
>> They're-
>> When you do that,
are you choosing the images where you have the most
confidence?
>> Yes. >> If you had to reject some,
for every batch you have the same confidence.
>> Yeah, it's that we don't consider with that we don't
confuse that with our in our case.
And we just start to remove the noise here.
>> Okay. >> So of all the work
that you have presented, which one do you think is closest to
taking out of the lab- >> What?
>> To the real world?
>> Sorry, can you say it again?
>> Which particular work will be ready to ship, be-
>> To ship the product?
Yea, I think that for the redirect data for the learning,
it's already in the Google search engine,
inside to how many hours that it's already translated for
the publisher write the book,
you'll say that to a lot of video typing using my network.
>> I see.
>> And also for the focal learning and
also they worked on the video per my insistence.
>> He said, I said.
>> Okay. >> We can.
>> Send him to her.
So we're, man,
friend, man, node.
Không có nhận xét nào:
Đăng nhận xét